Arena-Hard-Auto
The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.
Here are some key features of the Arena-Hard-Auto benchmark:
- It contains 500 challenging user queries¹.
- It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹.
- It employs an automatic judge as a cheaper and faster approximator to human preference¹.
- It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹.
- If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.
The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².
(1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard:开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.