TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Math Word Problem Solving/MATH

Math Word Problem Solving on MATH

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1Gemini 2.0 Flash Experimental89.7No---
2Qwen2.5-Math-72B-Instruct(TIR,Greedy)88.1YesQwen2.5-Math Technical Report: Toward Mathematic...2024-09-18-
3GPT-4 Turbo (MACM, w/code, voting)87.92NoMACM: Utilizing a Multi-Agent System for Conditi...2024-04-06Code
4Qwen2.5-Math-72B-Instruct(COT,Greedy)85.9YesQwen2.5-Math Technical Report: Toward Mathematic...2024-09-18-
5Qwen2.5-Math-7B-Instruct(TIR,Greedy)85.2YesQwen2.5-Math Technical Report: Toward Mathematic...2024-09-18-
6GPT-4-code model (CSV, w/ code, SC, k=16)84.3NoSolving Challenging Math Word Problems Using GPT...2023-08-15Code
7Qwen2-Math-72B-Instruct(greedy)84YesQwen2 Technical Report2024-07-15Code
8Qwen2.5-Math-7B-Instruct(COT,Greedy)83.6YesQwen2.5-Math Technical Report: Toward Mathematic...2024-09-18-
9Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)79.9YesQwen2.5-Math Technical Report: Toward Mathematic...2024-09-18-
10OpenMath2-Llama3.1-70B (majority@256)79.6YesOpenMathInstruct-2: Accelerating AI for Math wit...2024-10-02Code
11OpenMath2-Llama3.1-8B (majority@256)76.1YesOpenMathInstruct-2: Accelerating AI for Math wit...2024-10-02Code
12Qwen2.5-Math-1.5B-Instruct(COT,Greedy)75.8YesQwen2.5-Math Technical Report: Toward Mathematic...2024-09-18-
13GPT-4-code model (CSV, w/ code)73.5NoSolving Challenging Math Word Problems Using GPT...2023-08-15Code
14CR (GPT-4-turbo model, w/ code)72.2NoCumulative Reasoning with Large Language Models2023-08-08Code
15OpenMath2-Llama3.1-70B71.9YesOpenMathInstruct-2: Accelerating AI for Math wit...2024-10-02Code
16LogicNet (with code interpreter)71.2YesSolving Challenging Math Word Problems Using GPT...2023-08-15Code
17Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)70.8YesStep-DPO: Step-wise Preference Optimization for ...2024-06-26Code
18GPT-4-code model (w/ code)69.7NoSolving Challenging Math Word Problems Using GPT...2023-08-15Code
19OpenMath2-Llama3.1-8B67.8YesOpenMathInstruct-2: Accelerating AI for Math wit...2024-10-02Code
20AlphaMath-7B-SBS@366.3NoAlphaMath Almost Zero: Process Supervision witho...2024-05-06Code
21Minerva 62B (maj5@256)64.9NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
22DAMOMath-7B64.5Yes---
23MMOS-DeepSeekMath-7B(0-shot,k=50)63.7YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
24GPT-4-code model (w/o code)60.8NoSolving Challenging Math Word Problems Using GPT...2023-08-15Code
25OpenMath-CodeLlama-70B (w/ code, SC, k=50)60.4YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
26OpenMath-CodeLlama-34B (w/ code, SC, k=50)60.2YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
27ToRA-Code 34B model (w/ code, SC, k=50)60YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
28DeepSeekMATH-RL-7B (w/ code, greedy decoding)58.8YesDeepSeekMath: Pushing the Limits of Mathematical...2024-02-05Code
29OpenMath-Llama2-70B (w/ code, SC, k=50)58.3YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
30CR (GPT-4 model, w/o code)58NoCumulative Reasoning with Large Language Models2023-08-08Code
31OpenMath-CodeLlama-13B (w/ code, SC, k=50)57.6YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
32OpenMath-Mistral-7B (w/ code, SC, k=50)57.2YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
33ToRA 70B (w/ code, SC, k=50)56.9YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
34SKiC (GPT-4 model)56.4NoSkills-in-Context Prompting: Unlocking Compositi...2023-08-01-
35DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)56.1YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
36OpenMath-CodeLlama-7B (w/ code, SC, k=50)55.6YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
37MMOS-DeepSeekMath-7B(0-shot)55YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
38DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)54.9YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
39PHP (GPT-4 model)53.9NoProgressive-Hint Prompting Improves Reasoning in...2023-04-19Code
40DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)53.6YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
41Gemini Ultra (4-shot)53.2NoGemini: A Family of Highly Capable Multimodal Mo...2023-12-19Code
42DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)52.9YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
43GPT-4 model (w/ code, PAL)51.8NoPAL: Program-aided Language Models2022-11-18Code
44DeepSeekMATH-RL-7B (greedy decoding)51.7YesDeepSeekMath: Pushing the Limits of Mathematical...2024-02-05Code
45AlphaLLM (with MCTS)51NoToward Self-Improvement of LLMs via Imagination,...2024-04-18Code
46ToRA-Code 34B (w/ code)50.8YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
47OpenMath-CodeLlama-70B (w/ code)50.7YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
48Minerva 540B (maj1@k, k=64)50.3NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
49ToRA 70B (w/ code)49.7YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
50MMOS-CODE-34B(0-shot)49.5YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
51DeepSeekMath-7B-KPMath-Plus48.8NoKey-Point-Driven Data Synthesis with its Enhance...2024-03-04-
52PaLM 2 (few-shot, k=4, SC)48.8NoPaLM 2 Technical Report2023-05-17Code
53Llemma-34B-KPMath-Plus48.6NoKey-Point-Driven Data Synthesis with its Enhance...2024-03-04-
54OpenMath-CodeLlama-34B (w/ code)48.3YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
55Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)48.1YesMath-Shepherd: Verify and Reinforce LLMs Step-by...2023-12-14Code
56ToRA-Code 13B (w/ code)48.1YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
57Minerva 8B (maj5@256)47.6NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
58Mistral-7B-KPMath-Plus46.8YesKey-Point-Driven Data Synthesis with its Enhance...2024-03-04-
59DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)46.6YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
60OpenMath-Llama2-70B (w/ code)46.3YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
61OpenMath-CodeLlama-13B (w/ code)45.5YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
62DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)45.5YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
63DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)45.3YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
64MathCoder-CL-34B45.2YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
65MathCoder-L-34B45.1YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
66MMIQC-72B45YesAugmenting Math Word Problems via Iterative Ques...2024-01-17Code
67ToRA-Code 7B (w/ code)44.6YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
68OpenMath-Mistral-7B (w/ code)44.5YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
69MMOS-CODE-7B(0-shot)44.3YesAn Empirical Study of Data Ability Boundary in L...2024-02-23Code
70OpenMath-CodeLlama-7B (w/ code)43.6YesOpenMathInstruct-1: A 1.8 Million Math Instructi...2024-02-15Code
71Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)43.5YesMath-Shepherd: Verify and Reinforce LLMs Step-by...2023-12-14Code
72DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)43.5YesDART-Math: Difficulty-Aware Rejection Tuning for...2024-06-18Code
73Minerva 62B (maj1@k, k=64)43.4NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
74ToRA 13B (w/ code)43YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
75GPT-442.5NoSparks of Artificial General Intelligence: Early...2023-03-22Code
76SFT-Mistral-7B41.8Yes---
77Llama2-13B-KPMath-Plus41NoKey-Point-Driven Data Synthesis with its Enhance...2024-03-04-
78ToRA 7B (w/ code)40.1YesToRA: A Tool-Integrated Reasoning Agent for Math...2023-09-29Code
79MathCoder-CL-13B35.9YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
80MuggleMATH-70B35.6YesMuggleMath: Assessing the Impact of Query and Re...2023-10-09Code
81PaLM 2 (few-shot, k=4, CoT)34.3NoPaLM 2 Technical Report2023-05-17Code
82Minerva 540B33.6NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
83Minerva 540B (5-shot) mCoT33.6NoGalactica: A Large Language Model for Science2022-11-16Code
84Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)33YesMath-Shepherd: Verify and Reinforce LLMs Step-by...2023-12-14Code
85WizardMath-7B-V1.133YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
86Gemini Pro (4-shot)32.6NoGemini: A Family of Highly Capable Multimodal Mo...2023-12-19Code
87MuggleMATH-13B30.7YesMuggleMath: Assessing the Impact of Query and Re...2023-10-09Code
88MathCoder-CL-7B30.2YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
89MathCoder-L-13B29.9YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
90Qwen2idae-16x14B (4-shot)29.9NoParameter-Efficient Sparsity Crafting from Dense...2024-01-05Code
91OpenChat-3.5-1210 7B28.9NoOpenChat: Advancing Open-source Language Models ...2023-09-20Code
92OpenChat-3.5 7B28.6NoOpenChat: Advancing Open-source Language Models ...2023-09-20Code
93Mixtral 8x7B (maj@4)28.4NoMixtral of Experts2024-01-08Code
94Minerva 62B (4-shot)27.6NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
95MetaMath 70B26YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
96MuggleMATH 7B25.8YesMuggleMath: Assessing the Impact of Query and Re...2023-10-09Code
97Minerva 8B (maj1@k, k=64)25.4NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
98MathCoder-L-7B23.3YesMathCoder: Seamless Code Integration in LLMs for...2023-10-05Code
99WizardMath-70B-V1.022.7YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
100Camelidae-8×34B (4-shot)22.6NoParameter-Efficient Sparsity Crafting from Dense...2024-01-05Code
101MetaMath 13B22.5YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
102LLaMA 65B (maj1@k)20.5NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
103GAL 120B (5-shot) mCoT20.4NoGalactica: A Large Language Model for Science2022-11-16Code
104MetaMath 7B19.4YesMetaMath: Bootstrap Your Own Mathematical Questi...2023-09-21Code
105davinci-002 175B19.1NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
106Branch-Train-MiX 4x7B (sampling top-2 experts)17.8NoBranch-Train-MiX: Mixing Expert LLMs into a Mixt...2024-03-12Code
107GAL 120B <work>16.6NoGalactica: A Large Language Model for Science2022-11-16Code
108LLaMA 33B-maj1@k15.2NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
109Minerva 8B14.1NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
110WizardMath-13B-V1.014YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
111Mistral 7B (maj@4)13.1NoMistral 7B2023-10-10Code
112GAL 30B (5-shot) mCoT12.7NoGalactica: A Large Language Model for Science2022-11-16Code
113Mistral 7B (maj@4)12.7NoMixtral of Experts2024-01-08Code
114GAL 30B <work>11.4NoGalactica: A Large Language Model for Science2022-11-16Code
115WizardMath-7B-V1.010.7YesWizardMath: Empowering Mathematical Reasoning fo...2023-08-18Code
116LLaMA 65B10.6NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
117PaLM 540B8.8NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
118PaLM 540B (5-shot) mCoT8.8NoGalactica: A Large Language Model for Science2022-11-16Code
119LLaMA 13B-maj1@k8.8NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
120LLaMA 33B7.1NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
121LLaMA 7B-maj1@k6.9NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
122GPT-2 (1.5B)6.9NoMeasuring Mathematical Problem Solving With the ...2021-03-05Code
123GPT-2 (0.7B)6.4NoMeasuring Mathematical Problem Solving With the ...2021-03-05Code
124GPT-2 (0.3B)6.2NoMeasuring Mathematical Problem Solving With the ...2021-03-05Code
125GPT-3 13B5.6NoMeasuring Mathematical Problem Solving With the ...2021-03-05Code
126PaLM 8B (fine-tuned)5.6NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
127GPT-2 (0.1B)5.4NoMeasuring Mathematical Problem Solving With the ...2021-03-05Code
128GPT-3-175B (few-shot)5.2NoMeasuring Mathematical Problem Solving With the ...2021-03-05Code
129GPT-3 175B (8-shot)5.2NoGalactica: A Large Language Model for Science2022-11-16Code
130PaLM 62B4.4NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code
131LLaMA 13B3.9NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
132GPT-3-13B (few-shot)3NoMeasuring Mathematical Problem Solving With the ...2021-03-05Code
133LLaMA 7B2.9NoLLaMA: Open and Efficient Foundation Language Mo...2023-02-27Code
134GPT-3 2.7B2.9NoMeasuring Mathematical Problem Solving With the ...2021-03-05Code
135PaLM 8B1.5NoSolving Quantitative Reasoning Problems with Lan...2022-06-29Code