TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Arithmetic Reasoning/GSM8K

Arithmetic Reasoning on GSM8K

Metric: Parameters (Billion) (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Parameters (Billion)▼Extra DataPaperDate↕Code
1PaLM 540B (Self Improvement, Self Consistency)540NoLarge Language Models Can Self-Improve2022-10-20-
2Minerva 540B (CoT)540NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
3PaLM 540B maj1@40 (8-shot)540YesSelf-Consistency Improves Chain of Thought Reaso...2022-03-21Code
4PaLM 540B (Self Consistency)540NoLarge Language Models Can Self-Improve2022-10-20-
5PaLM 540B (Self Improvement, CoT Prompting)540NoLarge Language Models Can Self-Improve2022-10-20-
6U-PaLM540NoTranscending Scaling Laws with 0.1% Extra Compute2022-10-20-
7PaLM-540B (few-Shot-cot)540YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
8PaLM 540B (8-shot)540YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
9PaLM 540B (CoT Prompting)540NoLarge Language Models Can Self-Improve2022-10-20-
10PaLM 540B (Self Improvement, Standard-Prompting)540NoLarge Language Models Can Self-Improve2022-10-20-
11PaLM 540B (few-shot)540YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
12PaLM 540B (Standard-Prompting)540NoLarge Language Models Can Self-Improve2022-10-20-
13code-davinci-002 175B (LEVER, 8-shot)175NoLEVER: Learning to Verify Language-to-Code Gener...2023-02-16Code
14DIVERSE 175B (8-shot)175NoMaking Large Language Models Better Reasoners wi...2022-06-06-
15code-davinci-002 (Least-to-Most Prompting)175NoLeast-to-Most Prompting Enables Complex Reasonin...2022-05-21Code
16Finetuned GPT-3 175B + verifier175YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
17Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))175YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
18text-davinci-002 175B (2-shot, CoT)175YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
19text-davinci-002 175B (0-shot, CoT)175YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
20Text-davinci-002-175B (0-shot)175YesLarge Language Models are Zero-Shot Reasoners2022-05-24Code
21RFT 70B79YesScaling Relationship on Learning Mathematical Re...2023-08-03Code
22Jiutian-大模型75No---
23Qwen2-Math-72B-Instruct (greedy)72YesQwen2 Technical Report2024-07-15Code
24AlphaLLM (with MCTS)70NoToward Self-Improvement of LLMs via Imagination,...2024-04-18Code
25OpenMath-CodeLlama-70B (w/ code, SC, k=50)70YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
26DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)70YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
27OpenMath-Llama2-70B (w/ code, SC, k=50)70YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
28DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)70YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
29ToRA-70B (SC, k=50)70YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
30DeepMind 70B Model (SFT+ORM-RL, ORM reranking)70YesSolving math word problems with process- and out...2022-11-25-
31DeepMind 70B Model (SFT+PRM-RL, PRM reranking)70YesSolving math word problems with process- and out...2022-11-25-
32OpenMath-Llama2-70B (w/ code)70YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
33OpenMath-CodeLlama-70B (w/ code)70YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
34ToRA 70B70YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
35MathCoder-L-70B70YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
36MetaMath 70B70YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
37MuggleMATH 70B70YesMuggleMath: Assessing the Impact of Query and Re...2023-10-09Code
38WizardMath-70B-V1.070YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
39DeepMind 70B Model (STaR, maj1@96)70YesSolving math word problems with process- and out...2022-11-25-
40Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)70NoThe Unreasonable Effectiveness of Eccentric Auto...2024-02-09-
41LLaMA 2 70B (CoT-Influx)70NoFewer is More: Boosting LLM Reasoning with Reinf...2023-12-14-
42LLaMA 2 70B (on-shot)70NoLlama 2: Open Foundation and Fine-Tuned Chat Mod...2023-07-18Code
43LLaMA 65B-maj1@k65NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
44LLaMA 65B65NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
45Minerva 62B (maj5@100)62NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
46Minerva 62B (maj1@100)62YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
47Minerva 62B (8-shot)62YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
48PaLM 62B (8-shot)62YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
49OpenMath-CodeLlama-34B (w/ code, SC, k=50)34YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
50ToRA-Code-34B (SC, k=50)34YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
51MathCoder-CL-34B34YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
52ToRA-Code 34B34YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
53OpenMath-CodeLlama-34B (w/ code)34YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
54MMOS-CODE-34B(0-shot)34YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
55Llemma 34B34NoLlemma: An Open Language Model For Mathematics2023-10-16Code
56LLaMA 33B-maj1@k33NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
57LLaMA 33B33NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
58UL2 20B (chain-of-thought)20NoUL2: Unifying Language Learning Paradigms2022-05-10Code
59UL2 20B (0-shot)20NoUL2: Unifying Language Learning Paradigms2022-05-10Code
60Llama SFT (Metamath ToRA Ensemble)13Yes---
61OpenMath-CodeLlama-13B (w/ code, SC, k=50)13YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
62OpenMath-CodeLlama-13B (w/ code)13YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
63ToRA-Code 13B13YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
64MuggleMATH 13B13YesMuggleMath: Assessing the Impact of Query and Re...2023-10-09Code
65KwaiYiiMath 13B13YesKwaiYiiMath: Technical Report2023-10-11-
66MathCoder-L-13B13YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
67MetaMath 13B13YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
68WizardMath-13B-V1.013YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
69Orca 2 13B13NoOrca 2: Teaching Small Language Models How to Re...2023-11-18-
70RFT 13B13YesScaling Relationship on Learning Mathematical Re...2023-08-03Code
71Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)13NoThe Unreasonable Effectiveness of Eccentric Auto...2024-02-09-
72Vicuna (SYRELM)13YesFrugal LMs Trained to Invoke Symbolic Solvers Ac...2023-12-09Code
73LLaMA 13B-maj1@k13NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
74LLaMA 13B13NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
75GPT-J (CoRe)12NoSolving Math Word Problems via Cooperative Reaso...2022-10-28Code
76DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)8YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
77DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)8YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
78Minerva 8B (maj5@100)8NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
79Minerva 8B-maj1@k (8-shot)8YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
80Minerva 8B (8-shot)8YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
81PaLM 8B (8-shot)8YesSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
82SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)7Yes---
83DAMOMath-7B(MetaMath, OVM, BS, Ensemble)7Yes---
84SFT-Mistral-7B (Metamath + ovm +ensemble)7Yes---
85DAMOMath-7B(MetaMath, OVM, Ensemble)7Yes---
86Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)7YesMath-Shepherd: Verify and Reinforce LLMs Step-by...2023-12-14Code
87DeepSeekMATH-RL-7B7YesDeepSeekMath: Pushing the Limits of Mathematical...2024-02-05Code
88DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)7YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
89MMOS-DeepSeekMath-7B(0-shot,k=50)7YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
90OpenMath-Mistral-7B (w/ code, SC, k=50)7YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
91Orca-Math 7B (fine-tuned)7YesOrca-Math: Unlocking the potential of SLMs in Gr...2024-02-16-
92DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)7YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
93OpenMath-CodeLlama-7B (w/ code, SC, k=50)7YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
94OVM-Mistral-7B (verify100@1)7NoOVM, Outcome-supervised Value Models for Plannin...2023-11-16Code
95Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)7YesMath-Shepherd: Verify and Reinforce LLMs Step-by...2023-12-14Code
96WizardMath-7B-V1.17YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
97OVM-Mistral-7B (verify20@1)7NoOVM, Outcome-supervised Value Models for Plannin...2023-11-16Code
98DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)7YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
99DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)7YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
100MMOS-DeepSeekMath-7B(0-shot)7YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
101OpenMath-Mistral-7B (w/ code)7YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
102MetaMath-Mistral-7B7YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
103OpenChat-3.5 7B7NoOpenChat: Advancing Open-source Language Models ...2023-09-20Code
104Arithmo2-Mistral-7B7No---
105OpenMath-CodeLlama-7B (w/ code)7YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
106Arithmo-Mistral-7B7No---
107MathCoder-CL-13B7YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
108MMOS-CODE-7B(0-shot)7YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
109OVM-Llama2-7B (verify100@1)7NoOVM, Outcome-supervised Value Models for Plannin...2023-11-16Code
110ToRA-Code 7B7YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
111MuggleMATH 7B7YesMuggleMath: Assessing the Impact of Query and Re...2023-10-09Code
112MathCoder-CL-7B7YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
113MetaMath 7B7YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
114MathCoder-L-7B7YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
115WizardMath-7B-V1.07YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
116Mistral 7B (maj@8)7NoMistral 7B2023-10-10Code
117RFT 7B7YesScaling Relationship on Learning Mathematical Re...2023-08-03Code
118Orca 2 7B7NoOrca 2: Teaching Small Language Models How to Re...2023-11-18-
119Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)7NoThe Unreasonable Effectiveness of Eccentric Auto...2024-02-09-
120Llemma 7B7NoLlemma: An Open Language Model For Mathematics2023-10-16Code
121LLaMA 7B (maj1@k)7NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
122LLaMA 7B7NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
123Shivaay-4B (8-shot chain-of-thought)4No---
124Phi-GSM 2.7B (fine-tuned)2.7NoTinyGSM: achieving >80% on GSM8k with small lang...2023-12-14-
125GPT-Neo-2.7B + Self-Sampling2.7NoLearning Math Reasoning from Self-Sampled Correc...2022-05-28Code
126Phi-GSM+V 1.3B+1.3B (verify48@1)2.6NoTinyGSM: achieving >80% on GSM8k with small lang...2023-12-14-
127CodeT5+0.77NoCodeT5+: Open Code Large Language Models for Cod...2023-05-13Code
128GPT-2-Medium 355M + question-solution classifier (BS=5)0.355NoComposing Ensembles of Pre-trained Models via It...2022-10-20-
129GPT-2-Medium 355M (fine-tuned, BS=5)0.355NoComposing Ensembles of Pre-trained Models via It...2022-10-20-
130GPT-2-Medium 355M + question-solution classifier (BS=1)0.355NoComposing Ensembles of Pre-trained Models via It...2022-10-20-
131GPT-2-Medium 355M (BS=5)0.355NoComposing Ensembles of Pre-trained Models via It...2022-10-20-
132GPT-Neo 125M + Self-Sampling0.125NoLearning Math Reasoning from Self-Sampled Correc...2022-05-28Code