AutoIF (Qwen2 72B)

Reported on 4 benchmarks across 1 task · 1 paper

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing4 results

Instruction FollowingonIFEval
Inst-level loose-accuracy· 2024-06-19
88
best: 90.4 (AutoIF (Llama3 70B))
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models arXiv:2406.13542
Instruction FollowingonIFEval
Inst-level strict-accuracy· 2024-06-19
86.1
best: 86.7 (AutoIF (Llama3 70B))
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models arXiv:2406.13542
Instruction FollowingonIFEval
Prompt-level loose-accuracy· 2024-06-19
82.3
best: 85.6 (AutoIF (Llama3 70B))
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models arXiv:2406.13542
Instruction FollowingonIFEval
Prompt-level strict-accuracy· 2024-06-19
80.2
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models arXiv:2406.13542