Metric: pass@1 (higher is better)
| # | Model↕ | pass@1▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | o1-preview | 0.952 | No | A Case Study of Web App Coding with OpenAI Reaso... | 2024-09-19 | Code |
| 2 | o1-mini | 0.939 | No | A Case Study of Web App Coding with OpenAI Reaso... | 2024-09-19 | Code |
| 3 | gpt-4o-2024-08-06 | 0.885 | No | Insights from Benchmarking Frontier Language Mod... | 2024-09-08 | Code |
| 4 | claude-3.5-sonnet | 0.8808 | No | Insights from Benchmarking Frontier Language Mod... | 2024-09-08 | Code |
| 5 | deepseek-v2.5 | 0.834 | No | A Case Study of Web App Coding with OpenAI Reaso... | 2024-09-19 | Code |
| 6 | mistral-large-2 | 0.7804 | No | Insights from Benchmarking Frontier Language Mod... | 2024-09-08 | Code |
| 7 | deepseek-coder-v2-instruct | 0.7002 | No | Insights from Benchmarking Frontier Language Mod... | 2024-09-08 | Code |
| 8 | llama-v3p1-405b-instruct | 0.302 | No | Insights from Benchmarking Frontier Language Mod... | 2024-09-08 | Code |