Metric: Prompt-level strict-accuracy (higher is better)
| # | Model↕ | Prompt-level strict-accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | AutoIF (Llama3 70B) | 80.2 | No | Self-play with Execution Feedback: Improving Ins... | 2024-06-19 | Code |
| 2 | AutoIF (Qwen2 72B) | 80.2 | No | Self-play with Execution Feedback: Improving Ins... | 2024-06-19 | Code |
| 3 | GPT-4 | 76.89 | No | Instruction-Following Evaluation for Large Langu... | 2023-11-14 | Code |
| 4 | PaLM 2 S | 43.07 | No | Instruction-Following Evaluation for Large Langu... | 2023-11-14 | Code |