Arena-Hard
The Arena-Hard benchmark is a high-quality benchmarking tool for Language Learning Models (LLMs) developed by LMSYS Org¹. It was designed to address the limitations of traditional benchmarks, which are often static or close-ended¹.
Key features of the Arena-Hard benchmark include¹²:
- Robustly separates model capability: It can differentiate the capabilities of various models.
- Reflects human preference in real-world use cases: The benchmark score has a high agreement with human preference.
- Frequently updates to avoid over-fitting or test set leakage: It uses new, unseen prompts to ensure the benchmark remains challenging and relevant.
The Arena-Hard benchmark is built from live data in the Chatbot Arena, a crowd-sourced platform for LLM evaluations¹. It contains 500 challenging user queries². The benchmark uses GPT-4-Turbo as a judge to compare the responses of different models against a baseline model².
The Arena-Hard benchmark has been found to offer significantly stronger separability against other benchmarks, with tighter confidence intervals¹. It also has a higher agreement (89.1%) with the human preference ranking by Chatbot Arena¹.
(1) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (2) GitHub - lm-sys/arena-hard: Arena-Hard benchmark. https://github.com/lm-sys/arena-hard. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard: Arena-Hard benchmark. https://github.com/lm-sys/arena-hard. (5) undefined. https://github.com/lm-sys/arena-hard.git. (6) undefined. https://huggingface.co/spaces/lmsys/arena-hard-browser.