TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Self-play with Execution Feedback: Improving Instruction-f...

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou

2024-06-19Instruction Following
PaperPDFCode(official)

Abstract

One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.

Results

TaskDatasetMetricValueModel
Instruction FollowingIFEvalInst-level loose-accuracy90.4AutoIF (Llama3 70B)
Instruction FollowingIFEvalInst-level strict-accuracy86.7AutoIF (Llama3 70B)
Instruction FollowingIFEvalPrompt-level loose-accuracy85.6AutoIF (Llama3 70B)
Instruction FollowingIFEvalPrompt-level strict-accuracy80.2AutoIF (Llama3 70B)
Instruction FollowingIFEvalInst-level loose-accuracy88AutoIF (Qwen2 72B)
Instruction FollowingIFEvalInst-level strict-accuracy86.1AutoIF (Qwen2 72B)
Instruction FollowingIFEvalPrompt-level loose-accuracy82.3AutoIF (Qwen2 72B)
Instruction FollowingIFEvalPrompt-level strict-accuracy80.2AutoIF (Qwen2 72B)

Related Papers

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17How Many Instructions Can LLMs Follow at Once?2025-07-15DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering2025-07-15Multilingual Multimodal Software Developer for Code Generation2025-07-11TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data2025-07-08DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment2025-07-03Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks2025-07-03Kwai Keye-VL Technical Report2025-07-02