Math Word Problem Solving on MATH

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	Gemini 2.0 Flash Experimental	89.7	No	-	-	-
2	Qwen2.5-Math-72B-Instruct(TIR,Greedy)	88.1	Yes	Qwen2.5-Math Technical Report: Toward Mathematic...	2024-09-18	-
3	GPT-4 Turbo (MACM, w/code, voting)	87.92	No	MACM: Utilizing a Multi-Agent System for Conditi...	2024-04-06	Code
4	Qwen2.5-Math-72B-Instruct(COT,Greedy)	85.9	Yes	Qwen2.5-Math Technical Report: Toward Mathematic...	2024-09-18	-
5	Qwen2.5-Math-7B-Instruct(TIR,Greedy)	85.2	Yes	Qwen2.5-Math Technical Report: Toward Mathematic...	2024-09-18	-
6	GPT-4-code model (CSV, w/ code, SC, k=16)	84.3	No	Solving Challenging Math Word Problems Using GPT...	2023-08-15	Code
7	Qwen2-Math-72B-Instruct(greedy)	84	Yes	Qwen2 Technical Report	2024-07-15	Code
8	Qwen2.5-Math-7B-Instruct(COT,Greedy)	83.6	Yes	Qwen2.5-Math Technical Report: Toward Mathematic...	2024-09-18	-
9	Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)	79.9	Yes	Qwen2.5-Math Technical Report: Toward Mathematic...	2024-09-18	-
10	OpenMath2-Llama3.1-70B (majority@256)	79.6	Yes	OpenMathInstruct-2: Accelerating AI for Math wit...	2024-10-02	Code
11	OpenMath2-Llama3.1-8B (majority@256)	76.1	Yes	OpenMathInstruct-2: Accelerating AI for Math wit...	2024-10-02	Code
12	Qwen2.5-Math-1.5B-Instruct(COT,Greedy)	75.8	Yes	Qwen2.5-Math Technical Report: Toward Mathematic...	2024-09-18	-
13	GPT-4-code model (CSV, w/ code)	73.5	No	Solving Challenging Math Word Problems Using GPT...	2023-08-15	Code
14	CR (GPT-4-turbo model, w/ code)	72.2	No	Cumulative Reasoning with Large Language Models	2023-08-08	Code
15	OpenMath2-Llama3.1-70B	71.9	Yes	OpenMathInstruct-2: Accelerating AI for Math wit...	2024-10-02	Code
16	LogicNet (with code interpreter)	71.2	Yes	Solving Challenging Math Word Problems Using GPT...	2023-08-15	Code
17	Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)	70.8	Yes	Step-DPO: Step-wise Preference Optimization for ...	2024-06-26	Code
18	GPT-4-code model (w/ code)	69.7	No	Solving Challenging Math Word Problems Using GPT...	2023-08-15	Code
19	OpenMath2-Llama3.1-8B	67.8	Yes	OpenMathInstruct-2: Accelerating AI for Math wit...	2024-10-02	Code
20	AlphaMath-7B-SBS@3	66.3	No	AlphaMath Almost Zero: Process Supervision witho...	2024-05-06	Code
21	Minerva 62B (maj5@256)	64.9	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
22	DAMOMath-7B	64.5	Yes	-	-	-
23	MMOS-DeepSeekMath-7B(0-shot,k=50)	63.7	Yes	An Empirical Study of Data Ability Boundary in L...	2024-02-23	Code
24	GPT-4-code model (w/o code)	60.8	No	Solving Challenging Math Word Problems Using GPT...	2023-08-15	Code
25	OpenMath-CodeLlama-70B (w/ code, SC, k=50)	60.4	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
26	OpenMath-CodeLlama-34B (w/ code, SC, k=50)	60.2	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
27	ToRA-Code 34B model (w/ code, SC, k=50)	60	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
28	DeepSeekMATH-RL-7B (w/ code, greedy decoding)	58.8	Yes	DeepSeekMath: Pushing the Limits of Mathematical...	2024-02-05	Code
29	OpenMath-Llama2-70B (w/ code, SC, k=50)	58.3	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
30	CR (GPT-4 model, w/o code)	58	No	Cumulative Reasoning with Large Language Models	2023-08-08	Code
31	OpenMath-CodeLlama-13B (w/ code, SC, k=50)	57.6	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
32	OpenMath-Mistral-7B (w/ code, SC, k=50)	57.2	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
33	ToRA 70B (w/ code, SC, k=50)	56.9	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
34	SKiC (GPT-4 model)	56.4	No	Skills-in-Context Prompting: Unlocking Compositi...	2023-08-01	-
35	DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)	56.1	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
36	OpenMath-CodeLlama-7B (w/ code, SC, k=50)	55.6	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
37	MMOS-DeepSeekMath-7B(0-shot)	55	Yes	An Empirical Study of Data Ability Boundary in L...	2024-02-23	Code
38	DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)	54.9	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
39	PHP (GPT-4 model)	53.9	No	Progressive-Hint Prompting Improves Reasoning in...	2023-04-19	Code
40	DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)	53.6	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
41	Gemini Ultra (4-shot)	53.2	No	Gemini: A Family of Highly Capable Multimodal Mo...	2023-12-19	Code
42	DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)	52.9	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
43	GPT-4 model (w/ code, PAL)	51.8	No	PAL: Program-aided Language Models	2022-11-18	Code
44	DeepSeekMATH-RL-7B (greedy decoding)	51.7	Yes	DeepSeekMath: Pushing the Limits of Mathematical...	2024-02-05	Code
45	AlphaLLM (with MCTS)	51	No	Toward Self-Improvement of LLMs via Imagination,...	2024-04-18	Code
46	ToRA-Code 34B (w/ code)	50.8	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
47	OpenMath-CodeLlama-70B (w/ code)	50.7	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
48	Minerva 540B (maj1@k, k=64)	50.3	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
49	ToRA 70B (w/ code)	49.7	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
50	MMOS-CODE-34B(0-shot)	49.5	Yes	An Empirical Study of Data Ability Boundary in L...	2024-02-23	Code
51	DeepSeekMath-7B-KPMath-Plus	48.8	No	Key-Point-Driven Data Synthesis with its Enhance...	2024-03-04	-
52	PaLM 2 (few-shot, k=4, SC)	48.8	No	PaLM 2 Technical Report	2023-05-17	Code
53	Llemma-34B-KPMath-Plus	48.6	No	Key-Point-Driven Data Synthesis with its Enhance...	2024-03-04	-
54	OpenMath-CodeLlama-34B (w/ code)	48.3	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
55	Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)	48.1	Yes	Math-Shepherd: Verify and Reinforce LLMs Step-by...	2023-12-14	Code
56	ToRA-Code 13B (w/ code)	48.1	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
57	Minerva 8B (maj5@256)	47.6	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
58	Mistral-7B-KPMath-Plus	46.8	Yes	Key-Point-Driven Data Synthesis with its Enhance...	2024-03-04	-
59	DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)	46.6	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
60	OpenMath-Llama2-70B (w/ code)	46.3	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
61	OpenMath-CodeLlama-13B (w/ code)	45.5	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
62	DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)	45.5	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
63	DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)	45.3	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
64	MathCoder-CL-34B	45.2	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
65	MathCoder-L-34B	45.1	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
66	MMIQC-72B	45	Yes	Augmenting Math Word Problems via Iterative Ques...	2024-01-17	Code
67	ToRA-Code 7B (w/ code)	44.6	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
68	OpenMath-Mistral-7B (w/ code)	44.5	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
69	MMOS-CODE-7B(0-shot)	44.3	Yes	An Empirical Study of Data Ability Boundary in L...	2024-02-23	Code
70	OpenMath-CodeLlama-7B (w/ code)	43.6	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
71	Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)	43.5	Yes	Math-Shepherd: Verify and Reinforce LLMs Step-by...	2023-12-14	Code
72	DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)	43.5	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
73	Minerva 62B (maj1@k, k=64)	43.4	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
74	ToRA 13B (w/ code)	43	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
75	GPT-4	42.5	No	Sparks of Artificial General Intelligence: Early...	2023-03-22	Code
76	SFT-Mistral-7B	41.8	Yes	-	-	-
77	Llama2-13B-KPMath-Plus	41	No	Key-Point-Driven Data Synthesis with its Enhance...	2024-03-04	-
78	ToRA 7B (w/ code)	40.1	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
79	MathCoder-CL-13B	35.9	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
80	MuggleMATH-70B	35.6	Yes	MuggleMath: Assessing the Impact of Query and Re...	2023-10-09	Code
81	PaLM 2 (few-shot, k=4, CoT)	34.3	No	PaLM 2 Technical Report	2023-05-17	Code
82	Minerva 540B	33.6	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
83	Minerva 540B (5-shot) mCoT	33.6	No	Galactica: A Large Language Model for Science	2022-11-16	Code
84	Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)	33	Yes	Math-Shepherd: Verify and Reinforce LLMs Step-by...	2023-12-14	Code
85	WizardMath-7B-V1.1	33	Yes	WizardMath: Empowering Mathematical Reasoning fo...	2023-08-18	Code
86	Gemini Pro (4-shot)	32.6	No	Gemini: A Family of Highly Capable Multimodal Mo...	2023-12-19	Code
87	MuggleMATH-13B	30.7	Yes	MuggleMath: Assessing the Impact of Query and Re...	2023-10-09	Code
88	MathCoder-CL-7B	30.2	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
89	MathCoder-L-13B	29.9	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
90	Qwen2idae-16x14B (4-shot)	29.9	No	Parameter-Efficient Sparsity Crafting from Dense...	2024-01-05	Code
91	OpenChat-3.5-1210 7B	28.9	No	OpenChat: Advancing Open-source Language Models ...	2023-09-20	Code
92	OpenChat-3.5 7B	28.6	No	OpenChat: Advancing Open-source Language Models ...	2023-09-20	Code
93	Mixtral 8x7B (maj@4)	28.4	No	Mixtral of Experts	2024-01-08	Code
94	Minerva 62B (4-shot)	27.6	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
95	MetaMath 70B	26	Yes	MetaMath: Bootstrap Your Own Mathematical Questi...	2023-09-21	Code
96	MuggleMATH 7B	25.8	Yes	MuggleMath: Assessing the Impact of Query and Re...	2023-10-09	Code
97	Minerva 8B (maj1@k, k=64)	25.4	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
98	MathCoder-L-7B	23.3	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
99	WizardMath-70B-V1.0	22.7	Yes	WizardMath: Empowering Mathematical Reasoning fo...	2023-08-18	Code
100	Camelidae-8×34B (4-shot)	22.6	No	Parameter-Efficient Sparsity Crafting from Dense...	2024-01-05	Code
101	MetaMath 13B	22.5	Yes	MetaMath: Bootstrap Your Own Mathematical Questi...	2023-09-21	Code
102	LLaMA 65B (maj1@k)	20.5	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
103	GAL 120B (5-shot) mCoT	20.4	No	Galactica: A Large Language Model for Science	2022-11-16	Code
104	MetaMath 7B	19.4	Yes	MetaMath: Bootstrap Your Own Mathematical Questi...	2023-09-21	Code
105	davinci-002 175B	19.1	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
106	Branch-Train-MiX 4x7B (sampling top-2 experts)	17.8	No	Branch-Train-MiX: Mixing Expert LLMs into a Mixt...	2024-03-12	Code
107	GAL 120B <work>	16.6	No	Galactica: A Large Language Model for Science	2022-11-16	Code
108	LLaMA 33B-maj1@k	15.2	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
109	Minerva 8B	14.1	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
110	WizardMath-13B-V1.0	14	Yes	WizardMath: Empowering Mathematical Reasoning fo...	2023-08-18	Code
111	Mistral 7B (maj@4)	13.1	No	Mistral 7B	2023-10-10	Code
112	GAL 30B (5-shot) mCoT	12.7	No	Galactica: A Large Language Model for Science	2022-11-16	Code
113	Mistral 7B (maj@4)	12.7	No	Mixtral of Experts	2024-01-08	Code
114	GAL 30B <work>	11.4	No	Galactica: A Large Language Model for Science	2022-11-16	Code
115	WizardMath-7B-V1.0	10.7	Yes	WizardMath: Empowering Mathematical Reasoning fo...	2023-08-18	Code
116	LLaMA 65B	10.6	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
117	PaLM 540B	8.8	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
118	PaLM 540B (5-shot) mCoT	8.8	No	Galactica: A Large Language Model for Science	2022-11-16	Code
119	LLaMA 13B-maj1@k	8.8	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
120	LLaMA 33B	7.1	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
121	LLaMA 7B-maj1@k	6.9	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
122	GPT-2 (1.5B)	6.9	No	Measuring Mathematical Problem Solving With the ...	2021-03-05	Code
123	GPT-2 (0.7B)	6.4	No	Measuring Mathematical Problem Solving With the ...	2021-03-05	Code
124	GPT-2 (0.3B)	6.2	No	Measuring Mathematical Problem Solving With the ...	2021-03-05	Code
125	GPT-3 13B	5.6	No	Measuring Mathematical Problem Solving With the ...	2021-03-05	Code
126	PaLM 8B (fine-tuned)	5.6	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
127	GPT-2 (0.1B)	5.4	No	Measuring Mathematical Problem Solving With the ...	2021-03-05	Code
128	GPT-3-175B (few-shot)	5.2	No	Measuring Mathematical Problem Solving With the ...	2021-03-05	Code
129	GPT-3 175B (8-shot)	5.2	No	Galactica: A Large Language Model for Science	2022-11-16	Code
130	PaLM 62B	4.4	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
131	LLaMA 13B	3.9	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
132	GPT-3-13B (few-shot)	3	No	Measuring Mathematical Problem Solving With the ...	2021-03-05	Code
133	LLaMA 7B	2.9	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
134	GPT-3 2.7B	2.9	No	Measuring Mathematical Problem Solving With the ...	2021-03-05	Code
135	PaLM 8B	1.5	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code

#1Gemini 2.0 Flash Experimental
89.7
Accuracy
No paper
#2Qwen2.5-Math-72B-Instruct(TIR,Greedy)SOTA
88.1
Accuracy· Extra Data· 2024-09-18
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
#3GPT-4 Turbo (MACM, w/code, voting)SOTA
87.92
Accuracy· 2024-04-06
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems Code
#4Qwen2.5-Math-72B-Instruct(COT,Greedy)
85.9
Accuracy· Extra Data· 2024-09-18
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
#5Qwen2.5-Math-7B-Instruct(TIR,Greedy)
85.2
Accuracy· Extra Data· 2024-09-18
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
#6GPT-4-code model (CSV, w/ code, SC, k=16)SOTA
84.3
Accuracy· 2023-08-15
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification Code
#7Qwen2-Math-72B-Instruct(greedy)
84
Accuracy· Extra Data· 2024-07-15
Qwen2 Technical Report Code
#8Qwen2.5-Math-7B-Instruct(COT,Greedy)
83.6
Accuracy· Extra Data· 2024-09-18
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
#9Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)
79.9
Accuracy· Extra Data· 2024-09-18
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
#10OpenMath2-Llama3.1-70B (majority@256)
79.6
Accuracy· Extra Data· 2024-10-02
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data Code
#11OpenMath2-Llama3.1-8B (majority@256)
76.1
Accuracy· Extra Data· 2024-10-02
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data Code
#12Qwen2.5-Math-1.5B-Instruct(COT,Greedy)
75.8
Accuracy· Extra Data· 2024-09-18
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
#13GPT-4-code model (CSV, w/ code)
73.5
Accuracy· 2023-08-15
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification Code
#14CR (GPT-4-turbo model, w/ code)SOTA
72.2
Accuracy· 2023-08-08
Cumulative Reasoning with Large Language Models Code
#15OpenMath2-Llama3.1-70B
71.9
Accuracy· Extra Data· 2024-10-02
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data Code
#16LogicNet (with code interpreter)
71.2
Accuracy· Extra Data· 2023-08-15
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification Code
#17Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)
70.8
Accuracy· Extra Data· 2024-06-26
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs Code
#18GPT-4-code model (w/ code)
69.7
Accuracy· 2023-08-15
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification Code
#19OpenMath2-Llama3.1-8B
67.8
Accuracy· Extra Data· 2024-10-02
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data Code
#20AlphaMath-7B-SBS@3
66.3
Accuracy· 2024-05-06
AlphaMath Almost Zero: Process Supervision without Process Code
#21Minerva 62B (maj5@256)SOTA
64.9
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#22DAMOMath-7B
64.5
Accuracy· Extra Data
No paper
#23MMOS-DeepSeekMath-7B(0-shot,k=50)
63.7
Accuracy· Extra Data· 2024-02-23
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning Code
#24GPT-4-code model (w/o code)
60.8
Accuracy· 2023-08-15
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification Code
#25OpenMath-CodeLlama-70B (w/ code, SC, k=50)
60.4
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#26OpenMath-CodeLlama-34B (w/ code, SC, k=50)
60.2
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#27ToRA-Code 34B model (w/ code, SC, k=50)
60
Accuracy· Extra Data· 2023-09-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Code
#28DeepSeekMATH-RL-7B (w/ code, greedy decoding)
58.8
Accuracy· Extra Data· 2024-02-05
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Code
#29OpenMath-Llama2-70B (w/ code, SC, k=50)
58.3
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#30CR (GPT-4 model, w/o code)
58
Accuracy· 2023-08-08
Cumulative Reasoning with Large Language Models Code
#31OpenMath-CodeLlama-13B (w/ code, SC, k=50)
57.6
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#32OpenMath-Mistral-7B (w/ code, SC, k=50)
57.2
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#33ToRA 70B (w/ code, SC, k=50)
56.9
Accuracy· Extra Data· 2023-09-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Code
#34SKiC (GPT-4 model)
56.4
Accuracy· 2023-08-01
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models
#35DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
56.1
Accuracy· Extra Data· 2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving Code
#36OpenMath-CodeLlama-7B (w/ code, SC, k=50)
55.6
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#37MMOS-DeepSeekMath-7B(0-shot)
55
Accuracy· Extra Data· 2024-02-23
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning Code
#38DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
54.9
Accuracy· Extra Data· 2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving Code
#39PHP (GPT-4 model)
53.9
Accuracy· 2023-04-19
Progressive-Hint Prompting Improves Reasoning in Large Language Models Code
#40DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
53.6
Accuracy· Extra Data· 2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving Code
#41Gemini Ultra (4-shot)
53.2
Accuracy· 2023-12-19
Gemini: A Family of Highly Capable Multimodal Models Code
#42DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
52.9
Accuracy· Extra Data· 2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving Code
#43GPT-4 model (w/ code, PAL)
51.8
Accuracy· 2022-11-18
PAL: Program-aided Language Models Code
#44DeepSeekMATH-RL-7B (greedy decoding)
51.7
Accuracy· Extra Data· 2024-02-05
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Code
#45AlphaLLM (with MCTS)
51
Accuracy· 2024-04-18
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing Code
#46ToRA-Code 34B (w/ code)
50.8
Accuracy· Extra Data· 2023-09-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Code
#47OpenMath-CodeLlama-70B (w/ code)
50.7
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#48Minerva 540B (maj1@k, k=64)
50.3
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#49ToRA 70B (w/ code)
49.7
Accuracy· Extra Data· 2023-09-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Code
#50MMOS-CODE-34B(0-shot)
49.5
Accuracy· Extra Data· 2024-02-23
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning Code
#51DeepSeekMath-7B-KPMath-Plus
48.8
Accuracy· 2024-03-04
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning
#52PaLM 2 (few-shot, k=4, SC)
48.8
Accuracy· 2023-05-17
PaLM 2 Technical Report Code
#53Llemma-34B-KPMath-Plus
48.6
Accuracy· 2024-03-04
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning
#54OpenMath-CodeLlama-34B (w/ code)
48.3
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#55Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)
48.1
Accuracy· Extra Data· 2023-12-14
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations Code
#56ToRA-Code 13B (w/ code)
48.1
Accuracy· Extra Data· 2023-09-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Code
#57Minerva 8B (maj5@256)
47.6
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#58Mistral-7B-KPMath-Plus
46.8
Accuracy· Extra Data· 2024-03-04
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning
#59DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
46.6
Accuracy· Extra Data· 2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving Code
#60OpenMath-Llama2-70B (w/ code)
46.3
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#61OpenMath-CodeLlama-13B (w/ code)
45.5
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#62DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
45.5
Accuracy· Extra Data· 2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving Code
#63DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
45.3
Accuracy· Extra Data· 2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving Code
#64MathCoder-CL-34B
45.2
Accuracy· Extra Data· 2023-10-05
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning Code
#65MathCoder-L-34B
45.1
Accuracy· Extra Data· 2023-10-05
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning Code
#66MMIQC-72B
45
Accuracy· Extra Data· 2024-01-17
Augmenting Math Word Problems via Iterative Question Composing Code
#67ToRA-Code 7B (w/ code)
44.6
Accuracy· Extra Data· 2023-09-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Code
#68OpenMath-Mistral-7B (w/ code)
44.5
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#69MMOS-CODE-7B(0-shot)
44.3
Accuracy· Extra Data· 2024-02-23
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning Code
#70OpenMath-CodeLlama-7B (w/ code)
43.6
Accuracy· Extra Data· 2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Code
#71Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
43.5
Accuracy· Extra Data· 2023-12-14
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations Code
#72DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
43.5
Accuracy· Extra Data· 2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving Code
#73Minerva 62B (maj1@k, k=64)
43.4
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#74ToRA 13B (w/ code)
43
Accuracy· Extra Data· 2023-09-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Code
#75GPT-4
42.5
Accuracy· 2023-03-22
Sparks of Artificial General Intelligence: Early experiments with GPT-4 Code
#76SFT-Mistral-7B
41.8
Accuracy· Extra Data
No paper
#77Llama2-13B-KPMath-Plus
41
Accuracy· 2024-03-04
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning
#78ToRA 7B (w/ code)
40.1
Accuracy· Extra Data· 2023-09-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Code
#79MathCoder-CL-13B
35.9
Accuracy· Extra Data· 2023-10-05
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning Code
#80MuggleMATH-70B
35.6
Accuracy· Extra Data· 2023-10-09
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning Code
#81PaLM 2 (few-shot, k=4, CoT)
34.3
Accuracy· 2023-05-17
PaLM 2 Technical Report Code
#82Minerva 540B
33.6
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#83Minerva 540B (5-shot) mCoT
33.6
Accuracy· 2022-11-16
Galactica: A Large Language Model for Science Code
#84Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
33
Accuracy· Extra Data· 2023-12-14
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations Code
#85WizardMath-7B-V1.1
33
Accuracy· Extra Data· 2023-08-18
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct Code
#86Gemini Pro (4-shot)
32.6
Accuracy· 2023-12-19
Gemini: A Family of Highly Capable Multimodal Models Code
#87MuggleMATH-13B
30.7
Accuracy· Extra Data· 2023-10-09
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning Code
#88MathCoder-CL-7B
30.2
Accuracy· Extra Data· 2023-10-05
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning Code
#89MathCoder-L-13B
29.9
Accuracy· Extra Data· 2023-10-05
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning Code
#90Qwen2idae-16x14B (4-shot)
29.9
Accuracy· 2024-01-05
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks Code
#91OpenChat-3.5-1210 7B
28.9
Accuracy· 2023-09-20
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data Code
#92OpenChat-3.5 7B
28.6
Accuracy· 2023-09-20
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data Code
#93Mixtral 8x7B (maj@4)
28.4
Accuracy· 2024-01-08
Mixtral of Experts Code
#94Minerva 62B (4-shot)
27.6
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#95MetaMath 70B
26
Accuracy· Extra Data· 2023-09-21
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models Code
#96MuggleMATH 7B
25.8
Accuracy· Extra Data· 2023-10-09
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning Code
#97Minerva 8B (maj1@k, k=64)
25.4
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#98MathCoder-L-7B
23.3
Accuracy· Extra Data· 2023-10-05
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning Code
#99WizardMath-70B-V1.0
22.7
Accuracy· Extra Data· 2023-08-18
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct Code
#100Camelidae-8×34B (4-shot)
22.6
Accuracy· 2024-01-05
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks Code
#101MetaMath 13B
22.5
Accuracy· Extra Data· 2023-09-21
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models Code
#102LLaMA 65B (maj1@k)
20.5
Accuracy· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#103GAL 120B (5-shot) mCoT
20.4
Accuracy· 2022-11-16
Galactica: A Large Language Model for Science Code
#104MetaMath 7B
19.4
Accuracy· Extra Data· 2023-09-21
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models Code
#105davinci-002 175B
19.1
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#106Branch-Train-MiX 4x7B (sampling top-2 experts)
17.8
Accuracy· 2024-03-12
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM Code
#107GAL 120B <work>
16.6
Accuracy· 2022-11-16
Galactica: A Large Language Model for Science Code
#108LLaMA 33B-maj1@k
15.2
Accuracy· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#109Minerva 8B
14.1
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#110WizardMath-13B-V1.0
14
Accuracy· Extra Data· 2023-08-18
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct Code
#111Mistral 7B (maj@4)
13.1
Accuracy· 2023-10-10
Mistral 7B Code
#112GAL 30B (5-shot) mCoT
12.7
Accuracy· 2022-11-16
Galactica: A Large Language Model for Science Code
#113Mistral 7B (maj@4)
12.7
Accuracy· 2024-01-08
Mixtral of Experts Code
#114GAL 30B <work>
11.4
Accuracy· 2022-11-16
Galactica: A Large Language Model for Science Code
#115WizardMath-7B-V1.0
10.7
Accuracy· Extra Data· 2023-08-18
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct Code
#116LLaMA 65B
10.6
Accuracy· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#117PaLM 540B
8.8
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#118PaLM 540B (5-shot) mCoT
8.8
Accuracy· 2022-11-16
Galactica: A Large Language Model for Science Code
#119LLaMA 13B-maj1@k
8.8
Accuracy· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#120LLaMA 33B
7.1
Accuracy· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#121LLaMA 7B-maj1@k
6.9
Accuracy· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#122GPT-2 (1.5B)SOTA
6.9
Accuracy· 2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset Code
#123GPT-2 (0.7B)
6.4
Accuracy· 2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset Code
#124GPT-2 (0.3B)
6.2
Accuracy· 2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset Code
#125GPT-3 13B
5.6
Accuracy· 2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset Code
#126PaLM 8B (fine-tuned)
5.6
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#127GPT-2 (0.1B)
5.4
Accuracy· 2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset Code
#128GPT-3-175B (few-shot)
5.2
Accuracy· 2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset Code
#129GPT-3 175B (8-shot)
5.2
Accuracy· 2022-11-16
Galactica: A Large Language Model for Science Code
#130PaLM 62B
4.4
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code
#131LLaMA 13B
3.9
Accuracy· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#132GPT-3-13B (few-shot)
3
Accuracy· 2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset Code
#133LLaMA 7B
2.9
Accuracy· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#134GPT-3 2.7B
2.9
Accuracy· 2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset Code
#135PaLM 8B
1.5
Accuracy· 2022-06-29
Solving Quantitative Reasoning Problems with Language Models Code