TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Arithmetic Reasoning/GSM8K

Arithmetic Reasoning on GSM8K

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1Claude 3.5 Sonnet (HPT)97.72NoHierarchical Prompting Taxonomy: A Universal Eva...2024-06-18Code
2DUP prompt upon GPT-497.1NoAchieving >97% on GSM8K: Deeply Understanding th...2024-04-23Code
3Qwen2-Math-72B-Instruct (greedy)96.7YesQwen2 Technical Report2024-07-15Code
4SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)96.4Yes---
5OpenMath2-Llama3.1-70B (majority@256)96YesOpenMathInstruct-2: Accelerating AI for Math wit...2024-10-02Code
6Jiutian-大模型95.2No---
7DAMOMath-7B(MetaMath, OVM, BS, Ensemble)95.1Yes---
8Claude 3 Opus (0-shot chain-of-thought)95No---
9OpenMath2-Llama3.1-70B94.9YesOpenMathInstruct-2: Accelerating AI for Math wit...2024-10-02Code
10GPT-4 (Teaching-Inspired)94.8NoTeaching-Inspired Integrated Prompting Framework...2024-10-10Code
11SFT-Mistral-7B (Metamath + ovm +ensemble)94.13Yes---
12OpenMath2-Llama3.1-8B (majority@256)94.1YesOpenMathInstruct-2: Accelerating AI for Math wit...2024-10-02Code
13Qwen2-72B-Instruct-Step-DPO (0-shot CoT)94YesStep-DPO: Step-wise Preference Optimization for ...2024-06-26Code
14DAMOMath-7B(MetaMath, OVM, Ensemble)93.2Yes---
15Claude 3 Sonnet (0-shot chain-of-thought)92.3No---
16AlphaLLM (with MCTS)92NoToward Self-Improvement of LLMs via Imagination,...2024-04-18Code
17OpenMath2-Llama3.1-8B91.7YesOpenMathInstruct-2: Accelerating AI for Math wit...2024-10-02Code
18PaLM 2 (few-shot, k=8, SC)91NoPaLM 2 Technical Report2023-05-17Code
19GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)90.91NoBreaking the Ceiling of the LLM Community by Tre...2024-06-18Code
20OpenMath-CodeLlama-70B (w/ code, SC, k=50)90.8YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
21DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)90.4YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
22OpenMath-Llama2-70B (w/ code, SC, k=50)90.1YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
23DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)89.6YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
24Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)89.1YesMath-Shepherd: Verify and Reinforce LLMs Step-by...2023-12-14Code
25Llama SFT (Metamath ToRA Ensemble)89Yes---
26Minerva 62B (maj5@100)89NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
27Claude 3 Haiku (0-shot chain-of-thought)88.9No---
28ToRA-70B (SC, k=50)88.3YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
29DeepSeekMATH-RL-7B88.2YesDeepSeekMath: Pushing the Limits of Mathematical...2024-02-05Code
30DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)88.2YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
31OpenMath-CodeLlama-34B (w/ code, SC, k=50)88YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
32Claude 2 (0-shot chain-of-thought)88No---
33Shivaay-4B (8-shot chain-of-thought)87.41No---
34DeepMind 70B Model (SFT+ORM-RL, ORM reranking)87.3YesSolving math word problems with process- and out...2022-11-25-
35MMOS-DeepSeekMath-7B(0-shot,k=50)87.2YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
36DeepMind 70B Model (SFT+PRM-RL, PRM reranking)87.1YesSolving math word problems with process- and out...2022-11-25-
37GPT-487.1NoSparks of Artificial General Intelligence: Early...2023-03-22Code
38OpenMath-Mistral-7B (w/ code, SC, k=50)86.9YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
39Orca-Math 7B (fine-tuned)86.8YesOrca-Math: Unlocking the potential of SLMs in Gr...2024-02-16-
40DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)86.8YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
41OpenMath-CodeLlama-13B (w/ code, SC, k=50)86.8YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
42Gemini Pro (maj1@32)86.5NoGemini: A Family of Highly Capable Multimodal Mo...2023-12-19Code
43Codex (Self-Evaluation Guided Decoding, PAL, multiple reasoning chains, 9-shot gen, 5-shot eval)85.5No---
44Claude 1.3 (0-shot chain-of-thought)85.2No---
45ToRA-Code-34B (SC, k=50)85.1YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
46OpenMath-CodeLlama-7B (w/ code, SC, k=50)84.8YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
47OVM-Mistral-7B (verify100@1)84.7NoOVM, Outcome-supervised Value Models for Plannin...2023-11-16Code
48OpenMath-Llama2-70B (w/ code)84.7YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
49OpenMath-CodeLlama-70B (w/ code)84.6YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
50code-davinci-002 175B (LEVER, 8-shot)84.5NoLEVER: Learning to Verify Language-to-Code Gener...2023-02-16Code
51ToRA 70B84.3YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
52Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)84.1YesMath-Shepherd: Verify and Reinforce LLMs Step-by...2023-12-14Code
53MathCoder-L-70B83.9YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
54WizardMath-7B-V1.183.2YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
55DIVERSE 175B (8-shot)83.2NoMaking Large Language Models Better Reasoners wi...2022-06-06-
56OVM-Mistral-7B (verify20@1)82.6NoOVM, Outcome-supervised Value Models for Plannin...2023-11-16Code
57DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)82.6YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
58ChatGPT (Ask, Refine, Trust)82.6NoThe ART of LLM Refinement: Ask, Refine, and Trust2023-11-14-
59DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)82.5YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
60MetaMath 70B82.3YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
61MuggleMATH 70B82.3YesMuggleMath: Assessing the Impact of Query and Re...2023-10-09Code
62PaLM 540B (Self Improvement, Self Consistency)82.1NoLarge Language Models Can Self-Improve2022-10-20-
63MathCoder-CL-34B81.7YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
64WizardMath-70B-V1.081.6YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
65Phi-GSM+V 1.3B+1.3B (verify48@1)81.5NoTinyGSM: achieving >80% on GSM8k with small lang...2023-12-14-
66DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)81.1YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
67DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)81.1YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
68Claude Instant 1.1 (0-shot chain-of-thought)80.9No---
69ToRA-Code 34B80.7YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
70OpenMath-CodeLlama-34B (w/ code)80.7YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
71PaLM 2 (few-shot, k=8, CoT)80.7NoPaLM 2 Technical Report2023-05-17Code
72MMOS-DeepSeekMath-7B(0-shot)80.5YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
73MMOS-CODE-34B(0-shot)80.4YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
74OpenMath-Mistral-7B (w/ code)80.2YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
75Self-Evaluation Guided Decoding (Codex, PAL, single reasoning chain, 9-shot gen, 5-shot eval)80.2No---
76OpenMath-CodeLlama-13B (w/ code)78.8YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
77Minerva 540B (CoT)78.5NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
78Camelidae-8×34B (5-shot)78.3NoParameter-Efficient Sparsity Crafting from Dense...2024-01-05Code
79Qwen2idae-16x14B (5-shot)77.8NoParameter-Efficient Sparsity Crafting from Dense...2024-01-05Code
80MetaMath-Mistral-7B77.7YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
81OpenChat-3.5 7B77.3NoOpenChat: Advancing Open-source Language Models ...2023-09-20Code
82DeepMind 70B Model (STaR, maj1@96)76.5YesSolving math word problems with process- and out...2022-11-25-
83Arithmo2-Mistral-7B76.4No---
84OpenMath-CodeLlama-7B (w/ code)75.9YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
85ToRA-Code 13B75.8YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
86Arithmo-Mistral-7B74.7No---
87PaLM 540B maj1@40 (8-shot)74.4YesSelf-Consistency Improves Chain of Thought Reaso...2022-03-21Code
88PaLM 540B (Self Consistency)74.4NoLarge Language Models Can Self-Improve2022-10-20-
89Phi-GSM 2.7B (fine-tuned)74.3NoTinyGSM: achieving >80% on GSM8k with small lang...2023-12-14-
90MathCoder-CL-13B74.1YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
91MuggleMATH 13B74YesMuggleMath: Assessing the Impact of Query and Re...2023-10-09Code
92MMOS-CODE-7B(0-shot)73.9YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
93CodeT5+73.8NoCodeT5+: Open Code Large Language Models for Cod...2023-05-13Code
94Llama-3.3-70B + CAPO73.73NoCAPO: Cost-Aware Prompt Optimization2025-04-22Code
95OVM-Llama2-7B (verify100@1)73.7NoOVM, Outcome-supervised Value Models for Plannin...2023-11-16Code
96PaLM 540B (Self Improvement, CoT Prompting)73.5NoLarge Language Models Can Self-Improve2022-10-20-
97KwaiYiiMath 13B73.3YesKwaiYiiMath: Technical Report2023-10-11-
98ToRA-Code 7B72.6YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
99MathCoder-L-13B72.6YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
100DBRX Base 132B72.3No---
101Self-Evaluation Guided Decoding (Codex, CoT, single reasoning chain, 9-shot gen, 5-shot eval)71.9No---
102MetaMath 13B71YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
103MuggleMATH 7B69.8YesMuggleMath: Assessing the Impact of Query and Re...2023-10-09Code
104LLaMA 65B-maj1@k69.7NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
105Minerva 62B (maj1@100)68.5YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
106code-davinci-002 (Least-to-Most Prompting)68.01NoLeast-to-Most Prompting Enables Complex Reasonin...2022-05-21Code
107MathCoder-CL-7B67.8YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
108DBRX Instruct 132B66.9No---
109MetaMath 7B66.4YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
110Mistral-Small-24B + CAPO65.07NoCAPO: Cost-Aware Prompt Optimization2025-04-22Code
111RFT 70B64.8YesScaling Relationship on Learning Mathematical Re...2023-08-03Code
112MathCoder-L-7B64.2YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
113WizardMath-13B-V1.063.9YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
114GPT-J (CoRe)63.2NoSolving Math Word Problems via Cooperative Reaso...2022-10-28Code
115Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)61NoThe Unreasonable Effectiveness of Eccentric Auto...2024-02-09-
116Qwen2.5-32B + CAPO60.2NoCAPO: Cost-Aware Prompt Optimization2025-04-22Code
117LLaMA 2 70B (CoT-Influx)59.59NoFewer is More: Boosting LLM Reasoning with Reinf...2023-12-14-
118Orca 2 13B59.14NoOrca 2: Teaching Small Language Models How to Re...2023-11-18-
119U-PaLM58.5NoTranscending Scaling Laws with 0.1% Extra Compute2022-10-20-
120PaLM-540B (few-Shot-cot)58.1YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
121GPT-3.5 (few-shot, k=5)57.1NoGPT-4 Technical Report2023-03-15Code
122Minerva 8B (maj5@100)56.8NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
123LLaMA 2 70B (on-shot)56.8NoLlama 2: Open Foundation and Fine-Tuned Chat Mod...2023-07-18Code
124PaLM 540B (8-shot)56.5YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
125PaLM 540B (CoT Prompting)56.5NoLarge Language Models Can Self-Improve2022-10-20-
126RFT 13B55.3YesScaling Relationship on Learning Mathematical Re...2023-08-03Code
127Finetuned GPT-3 175B + verifier55YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
128WizardMath-7B-V1.054.9YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
129LLaMA 33B-maj1@k53.1NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
130Minerva 62B (8-shot)52.4YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
131Mistral 7B (maj@8)52.2NoMistral 7B2023-10-10Code
132Llemma 34B51.5NoLlemma: An Open Language Model For Mathematics2023-10-16Code
133Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))51.5YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
134RFT 7B51.2YesScaling Relationship on Learning Mathematical Re...2023-08-03Code
135LLaMA 65B50.9NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
136Orca 2 7B47.23NoOrca 2: Teaching Small Language Models How to Re...2023-11-18-
137Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)43NoThe Unreasonable Effectiveness of Eccentric Auto...2024-02-09-
138text-davinci-002 175B (2-shot, CoT)41.3YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
139Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)41NoThe Unreasonable Effectiveness of Eccentric Auto...2024-02-09-
140text-davinci-002 175B (0-shot, CoT)40.7YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
141Branch-Train-MiX 4x7B (sampling top-2 experts)37.1NoBranch-Train-MiX: Mixing Expert LLMs into a Mixt...2024-03-12Code
142Llemma 7B36.4NoLlemma: An Open Language Model For Mathematics2023-10-16Code
143LLaMA 33B35.6NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
144Vicuna (SYRELM)35.2YesFrugal LMs Trained to Invoke Symbolic Solvers Ac...2023-12-09Code
145PaLM 62B (8-shot)33YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
146PaLM 540B (Self Improvement, Standard-Prompting)32.2NoLarge Language Models Can Self-Improve2022-10-20-
147LLaMA 13B-maj1@k29.3NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
148Minerva 8B-maj1@k (8-shot)28.4YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
149GPT-2-Medium 355M + question-solution classifier (BS=5)20.8NoComposing Ensembles of Pre-trained Models via It...2022-10-20-
150GPT-Neo-2.7B + Self-Sampling19.5NoLearning Math Reasoning from Self-Sampled Correc...2022-05-28Code
151GPT-2-Medium 355M (fine-tuned, BS=5)18.3NoComposing Ensembles of Pre-trained Models via It...2022-10-20-
152LLaMA 7B (maj1@k)18.1NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
153PaLM 540B (few-shot)17.9YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
154PaLM 540B (Standard-Prompting)17.9NoLarge Language Models Can Self-Improve2022-10-20-
155LLaMA 13B17.8NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
156GPT-2-Medium 355M + question-solution classifier (BS=1)16.8NoComposing Ensembles of Pre-trained Models via It...2022-10-20-
157Minerva 8B (8-shot)16.2YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
158GPT-2-Medium 355M (BS=5)12.2NoComposing Ensembles of Pre-trained Models via It...2022-10-20-
159LLaMA 7B11NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
160Text-davinci-002-175B (0-shot)10.4YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
161GPT-Neo 125M + Self-Sampling7.5NoLearning Math Reasoning from Self-Sampled Correc...2022-05-28Code
162UL2 20B (chain-of-thought)4.4NoUL2: Unifying Language Learning Paradigms2022-05-10Code
163PaLM 8B (8-shot)4.1YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
164UL2 20B (0-shot)4.1NoUL2: Unifying Language Learning Paradigms2022-05-10Code