Metric: 2k (higher is better)
| # | Model↕ | 2k▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | GPT-4-Turbo-1106 | 73.5 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 2 | GPT-4-Turbo-0125 | 73.5 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 3 | InternLM2-7b | 49.5 | No | InternLM2 Technical Report | 2024-03-26 | Code |
| 4 | GPT-3.5-Turbo-1106 | 48.5 | No | - | - | - |
| 5 | Claude-2 | 43.5 | No | - | - | - |
| 6 | Vicuna-13b-v1.5-16k | 29.2 | No | Judging LLM-as-a-Judge with MT-Bench and Chatbot... | 2023-06-09 | Code |
| 7 | ChatGLM3-6b-32k | 18.8 | No | GLM-130B: An Open Bilingual Pre-trained Model | 2022-10-05 | Code |
| 8 | Vicuna-7b-v1.5-16k | 11.1 | No | Judging LLM-as-a-Judge with MT-Bench and Chatbot... | 2023-06-09 | Code |
| 9 | ChatGLM2-6b-32k | 10.9 | No | GLM-130B: An Open Bilingual Pre-trained Model | 2022-10-05 | Code |
| 10 | LongChat-7b-v1.5-32k | 10.7 | No | Judging LLM-as-a-Judge with MT-Bench and Chatbot... | 2023-06-09 | Code |