TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Code Generation/MBPP

Code Generation on MBPP

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1EG-CFG (DeepSeek-V3-0324)96.6NoExecution Guided Line-by-Line Code Generation2025-06-12Code
2QualityFlow (Sonnet-3.5)94.2NoQualityFlow: An Agentic Workflow for Program Syn...2025-01-20-
3o1-mini + MapCoder (Hamming.ai)93.2YesMapCoder: Multi-Agent Code Generation for Compet...2024-05-18Code
4MGDebugger (DeepSeek-V3-0324)92.4NoFrom Code to Correctness: Closing the Last Mile ...2024-10-02Code
5GPT-4 + AgentCoder91.8NoAgentCoder: Multi-Agent-based Code Generation wi...2023-12-20Code
6CodeSim (GPT4o)90.7NoCODESIM: Multi-Agent Code Generation and Problem...2025-02-08Code
7Jiutian-大模型90No---
8GPT-3.5 Turbo (ChatGPT) + AgentCoder89.9NoAgentCoder: Multi-Agent-based Code Generation wi...2023-12-20Code
9MapCoder (GPT-4o)89.7NoMapCoder: Multi-Agent Code Generation for Compet...2024-05-18Code
10GPT-4 (ChatGPT Plus)87.5NoHow Does Naming Affect LLMs on Code Analysis Tas...2023-07-24-
11Claude 3 Opus86.4No---
12LPW (GPT-4o)84.8NoPlanning-Driven Programming: A Large Language Mo...2024-11-21Code
13AFlow(GPT-4o-mini)83.4NoAFlow: Automating Agentic Workflow Generation2024-10-14Code
14GPT-3.5 Turbo (ChatGPT)83.2NoHow Does Naming Affect LLMs on Code Analysis Tas...2023-07-24-
15EG-CFG (DeepSeek Coder 1.3b Instruct)83.2NoExecution Guided Line-by-Line Code Generation2025-06-12Code
16MapCoder (GPT-4)83.1NoMapCoder: Multi-Agent Code Generation for Compet...2024-05-18Code
17o1-mini + Language Agent Tree Search (Hamming.ai)82.3NoLanguage Agent Tree Search Unifies Reasoning Act...2023-10-06Code
18GPT-4 (Bing Chat)82NoHow Does Naming Affect LLMs on Code Analysis Tas...2023-07-24-
19GPT-3.5 Turbo + Language Agent Tree Search81.1NoLanguage Agent Tree Search Unifies Reasoning Act...2023-10-06Code
20MGDebugger (CodeQwen1.5)80.8NoFrom Code to Correctness: Closing the Last Mile ...2024-10-02Code
21Claude 3 Haiku80.4No---
22GPT-4 (Self-Debugging with unit tests + trace)80.2NoTeaching Large Language Models to Self-Debug2023-04-11Code
23GPT-4 (few-shot)80YesDeepSeek-Coder: When the Large Language Model Me...2024-01-25Code
24Claude 3 Sonnet79.4No---
25Bard (PaLM 2/chat-bison-001)76.2NoHow Does Naming Affect LLMs on Code Analysis Tas...2023-07-24-
26GPT-3.5 Turbo (Self-Debugging with unit tests + trace)72.8NoTeaching Large Language Models to Self-Debug2023-04-11Code
27Claude71.4NoHow Does Naming Affect LLMs on Code Analysis Tas...2023-07-24-
28code-davinci-002 175B (Self-Debugging with unit tests + trace)70.8NoTeaching Large Language Models to Self-Debug2023-04-11Code
29GPT-3.5 Turbo (few-shot)70.8YesDeepSeek-Coder: When the Large Language Model Me...2024-01-25Code
30DeepSeek-Coder-Instruct 33B (few-shot)70NoDeepSeek-Coder: When the Large Language Model Me...2024-01-25Code
31GPT-3.5 Turbo + INTERVENOR69.8NoINTERVENOR: Prompting the Coding Ability of Larg...2023-11-16Code
32code-davinci-002 175B + LEVER68.9NoLEVER: Learning to Verify Language-to-Code Gener...2023-02-16Code
33code-davinci-002 175B + CodeT67.7NoCodeT: Code Generation with Generated Tests2022-07-21Code
34GPT-3.5 Turbo (3-shot)67.6YesTeaching Large Language Models to Self-Debug2023-04-11Code
35code-davinci-002 175B + Reviewer66.9NoCoder Reviewer Reranking for Code Generation2022-11-29Code
36code-davinci-002 175B + Coder-Reviewer66.4NoCoder Reviewer Reranking for Code Generation2022-11-29Code
37StarCoder2-15B66.2NoStarCoder 2 and The Stack v2: The Next Generation2024-02-29Code
38DeepSeek-Coder-Base 33B (few-shot)66NoDeepSeek-Coder: When the Large Language Model Me...2024-01-25Code
39Code Llama - Python 70B (3-shot)65.5YesCode Llama: Open Foundation Models for Code2023-08-24Code
40DeepSeek-Coder-Instruct 6.7B (few-shot)65.4NoDeepSeek-Coder: When the Large Language Model Me...2024-01-25Code
41code-davinci-002 175B + MBR-Exec63NoCoder Reviewer Reranking for Code Generation2022-11-29Code
42Code Llama 70B (3-shot)62.4NoCode Llama: Open Foundation Models for Code2023-08-24Code
43Code Llama - Instruct 70B (3-shot)62.2NoCode Llama: Open Foundation Models for Code2023-08-24Code
44code-davinci-001 175B + CodeT61.9NoCodeT: Code Generation with Generated Tests2022-07-21Code
45code-davinci-002 175B (3-shot)61.4YesTeaching Large Language Models to Self-Debug2023-04-11Code
46Unnatural Code Llama 34B (3-shot)61.2NoCode Llama: Open Foundation Models for Code2023-08-24Code
47Mixtral 8x7B (3-shot)60.7NoMixtral of Experts2024-01-08Code
48DeepSeek-Coder-Base 6.7B (few-shot)60.6NoDeepSeek-Coder: When the Large Language Model Me...2024-01-25Code
49code-davinci-001 175B + MBR-Exec58.2NoNatural Language to Code Translation with Execut...2022-04-25Code
50Code Llama - Instruct 34B (3-shot)57NoCode Llama: Open Foundation Models for Code2023-08-24Code
51Code Llama - Python 34B (3-shot)56.2YesCode Llama: Open Foundation Models for Code2023-08-24Code
52code-cushman-001 12B (CodeT)55.4NoCodeT: Code Generation with Generated Tests2022-07-21Code
53Code Llama 34B (3-shot)55YesCode Llama: Open Foundation Models for Code2023-08-24Code
54StarCoder 15.5B (Self-Debugging with unit tests + trace)53.2NoTeaching Large Language Models to Self-Debug2023-04-11Code
55StarCoder 15.5B52.7NoStarCoder: may the source be with you!2023-05-09Code
56GPT-3.5 Turbo52.2YesCode Llama: Open Foundation Models for Code2023-08-24Code
57WizardCoder 15B51.8YesWizardCoder: Empowering Code Large Language Mode...2023-06-14Code
58PaLM 2-S* (few-shot)50NoPaLM 2 Technical Report2023-05-17Code
59CodeGen-Mono 16B + CodeT49.5NoCodeT: Code Generation with Generated Tests2022-07-21Code
60Code Llama - Instruct 13B (3-shot)49.4NoCode Llama: Open Foundation Models for Code2023-08-24Code
61DeepSeek-Coder-Instruct 1.3B (few-shot)49.4NoDeepSeek-Coder: When the Large Language Model Me...2024-01-25Code
62StarCoderBase 15.5B49NoStarCoder: may the source be with you!2023-05-09Code
63Code Llama - Python 13B (3-shot)49NoCode Llama: Open Foundation Models for Code2023-08-24Code
64Qwen2idae-16x14B (4-shot)48.6NoParameter-Efficient Sparsity Crafting from Dense...2024-01-05Code
65code-cushman-001 12B + MBR-Exec48.3NoCoder Reviewer Reranking for Code Generation2022-11-29Code
66Code Llama - Python 7B (3-shot)47.6NoCode Llama: Open Foundation Models for Code2023-08-24Code
67Mistral 7B (3-shot)47.5NoMistral 7B2023-10-10Code
68CodeGen 16B + MBR-Exec47.3NoCoder Reviewer Reranking for Code Generation2022-11-29Code
69StarCoder 15.5B (3-shot)47.2NoTeaching Large Language Models to Self-Debug2023-04-11Code
70PaLM Coder 540B47NoPaLM: Scaling Language Modeling with Pathways2022-04-05Code
71Code Llama 13B (3-shot)47NoCode Llama: Open Foundation Models for Code2023-08-24Code
72CodeGen 16B + Coder-Reviewer46.2NoCoder Reviewer Reranking for Code Generation2022-11-29Code
73DeepSeek-Coder-Base 1.3B (few-shot)46.2NoDeepSeek-Coder: When the Large Language Model Me...2024-01-25Code
74GPT-3.5 Turbo (few-shot)45.4NoINTERVENOR: Prompting the Coding Ability of Larg...2023-11-16Code
75Llama 2 70B (zero-shot)45NoLlama 2: Open Foundation and Fine-Tuned Chat Mod...2023-07-18Code
76Code Llama - Instruct 7B (3-shot)44.4NoCode Llama: Open Foundation Models for Code2023-08-24Code
77CodeGen 16B + Reviewer44.1NoCoder Reviewer Reranking for Code Generation2022-11-29Code
78phi-1.5-web 1.3B43.5NoTextbooks Are All You Need II: phi-1.5 technical...2023-09-11Code
79Branch-Train-Merge 4x7B (top-2)42.6NoBranch-Train-MiX: Mixing Expert LLMs into a Mixt...2024-03-12Code
80Code Llama 7B (3-shot)41.4NoCode Llama: Open Foundation Models for Code2023-08-24Code
81Camelidae-8×34B (4-shot)41.4NoParameter-Efficient Sparsity Crafting from Dense...2024-01-05Code
82GPT-3.5 Turbo (0-shot)39.8NoINTERVENOR: Prompting the Coding Ability of Larg...2023-11-16Code
83Branch-Train-MiX 4x7B (sampling top-2 experts)39.4NoBranch-Train-MiX: Mixing Expert LLMs into a Mixt...2024-03-12Code
84LLaMA 65B (0-shot)37.7NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
85PaLM 540B36.8NoPaLM: Scaling Language Modeling with Pathways2022-04-05Code
86SantaCoder 1.1B35NoStarCoder: may the source be with you!2023-05-09Code
87InCoder 6.7B + CodeT34.4NoCodeT: Code Generation with Generated Tests2022-07-21Code
88Llama 2 34B (0-shot)33NoLlama 2: Open Foundation and Fine-Tuned Chat Mod...2023-07-18Code
89Llama 2 13B (0-shot)30.6NoLlama 2: Open Foundation and Fine-Tuned Chat Mod...2023-07-18Code
90LLaMA 33B (0-shot)30.2NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
91InCoder 6.7B + MBR-Exec26.7NoCoder Reviewer Reranking for Code Generation2022-11-29Code
92InCoder 6.7B + Coder-Reviewer26.1NoCoder Reviewer Reranking for Code Generation2022-11-29Code
93InCoder 6.7B + Reviewer24.4NoCoder Reviewer Reranking for Code Generation2022-11-29Code
94CodeGeeX-13B24.4NoCodeGeeX: A Pre-Trained Model for Code Generatio...2023-03-30Code
95LLaMA 13B (0-shot)22NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
96Llama 2 7B (0-shot)20.8NoLlama 2: Open Foundation and Fine-Tuned Chat Mod...2023-07-18Code
97InCoder 6.7B (0-shot)19.4NoInCoder: A Generative Model for Code Infilling a...2022-04-12Code
98LLaMA 7B (0-shot)17.7NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code