Papers With Code 2 | ML Benchmarks, SotA Results & Code

The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.

Here are some key features of the Arena-Hard-Auto benchmark:

It contains 500 challenging user queries¹.
It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹.
It employs an automatic judge as a cheaper and faster approximator to human preference¹.
It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹.
If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.

The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².

(1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard：开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.