Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou

2024-06-19Instruction Following

Abstract

One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.

Results

Task	Dataset	Metric	Value	Model
Instruction Following	IFEval	Inst-level loose-accuracy	90.4	AutoIF (Llama3 70B)
Instruction Following	IFEval	Inst-level strict-accuracy	86.7	AutoIF (Llama3 70B)
Instruction Following	IFEval	Prompt-level loose-accuracy	85.6	AutoIF (Llama3 70B)
Instruction Following	IFEval	Prompt-level strict-accuracy	80.2	AutoIF (Llama3 70B)
Instruction Following	IFEval	Inst-level loose-accuracy	88	AutoIF (Qwen2 72B)
Instruction Following	IFEval	Inst-level strict-accuracy	86.1	AutoIF (Qwen2 72B)
Instruction Following	IFEval	Prompt-level loose-accuracy	82.3	AutoIF (Qwen2 72B)
Instruction Following	IFEval	Prompt-level strict-accuracy	80.2	AutoIF (Qwen2 72B)

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Abstract

Results

Related Papers

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Abstract

Results

Related Papers