WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, QIngwei Lin, Daxin Jiang

2023-06-14Code Generation HumanEval

Abstract

Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM

Results

Task	Dataset	Metric	Value	Model
Code Generation	CodeContests	Test Set pass@1	1.11	WizardCoder-15B
Code Generation	CodeContests	Test Set pass@5	3.18	WizardCoder-15B
Code Generation	CodeContests	Val Set pass@1	1.98	WizardCoder-15B
Code Generation	CodeContests	Val Set pass@5	3.27	WizardCoder-15B
Code Generation	MBPP	Accuracy	51.8	WizardCoder 15B

Related Papers

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18 Towards Formal Verification of LLM-Generated Code from Natural Language Prompts2025-07-17 MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks2025-07-16 Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16 The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15 Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding2025-07-14 CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks2025-07-14 CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance2025-07-14