Metric: Acc (higher is better)
| # | Model↕ | Acc▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Xolver | 94.4 | No | Xolver: Multi-Agent Reasoning with Holistic Expe... | 2025-06-17 | Code |
| 2 | DeepSeek-r1 | 79.8 | No | DeepSeek-R1: Incentivizing Reasoning Capability ... | 2025-01-22 | Code |
| 3 | Openai-o1 | 74.4 | No | - | - | - |
| 4 | Openai-o1-mini | 70 | No | - | - | - |
| 5 | Search-o1 | 56.7 | No | Search-o1: Agentic Search-Enhanced Large Reasoni... | 2025-01-09 | Code |
| 6 | s1-32B | 56.7 | No | s1: Simple test-time scaling | 2025-01-31 | Code |
| 7 | Openai-o1-preview | 44.6 | No | - | - | - |
| 8 | Qwen2.5-72B-Instruct | 23.3 | No | Qwen2.5 Technical Report | 2024-12-19 | Code |
| 9 | Claude3.5-Sonnet | 16 | No | - | - | - |