Arithmetic Reasoning on GSM8K

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	Claude 3.5 Sonnet (HPT)	97.72	No	Hierarchical Prompting Taxonomy: A Universal Eva...	2024-06-18	Code
2	DUP prompt upon GPT-4	97.1	No	Achieving >97% on GSM8K: Deeply Understanding th...	2024-04-23	Code
3	Qwen2-Math-72B-Instruct (greedy)	96.7	Yes	Qwen2 Technical Report	2024-07-15	Code
4	SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)	96.4	Yes	-	-	-
5	OpenMath2-Llama3.1-70B (majority@256)	96	Yes	OpenMathInstruct-2: Accelerating AI for Math wit...	2024-10-02	Code
6	Jiutian-大模型	95.2	No	-	-	-
7	DAMOMath-7B(MetaMath, OVM, BS, Ensemble)	95.1	Yes	-	-	-
8	Claude 3 Opus (0-shot chain-of-thought)	95	No	-	-	-
9	OpenMath2-Llama3.1-70B	94.9	Yes	OpenMathInstruct-2: Accelerating AI for Math wit...	2024-10-02	Code
10	GPT-4 (Teaching-Inspired)	94.8	No	Teaching-Inspired Integrated Prompting Framework...	2024-10-10	Code
11	SFT-Mistral-7B (Metamath + ovm +ensemble)	94.13	Yes	-	-	-
12	OpenMath2-Llama3.1-8B (majority@256)	94.1	Yes	OpenMathInstruct-2: Accelerating AI for Math wit...	2024-10-02	Code
13	Qwen2-72B-Instruct-Step-DPO (0-shot CoT)	94	Yes	Step-DPO: Step-wise Preference Optimization for ...	2024-06-26	Code
14	DAMOMath-7B(MetaMath, OVM, Ensemble)	93.2	Yes	-	-	-
15	Claude 3 Sonnet (0-shot chain-of-thought)	92.3	No	-	-	-
16	AlphaLLM (with MCTS)	92	No	Toward Self-Improvement of LLMs via Imagination,...	2024-04-18	Code
17	OpenMath2-Llama3.1-8B	91.7	Yes	OpenMathInstruct-2: Accelerating AI for Math wit...	2024-10-02	Code
18	PaLM 2 (few-shot, k=8, SC)	91	No	PaLM 2 Technical Report	2023-05-17	Code
19	GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)	90.91	No	Breaking the Ceiling of the LLM Community by Tre...	2024-06-18	Code
20	OpenMath-CodeLlama-70B (w/ code, SC, k=50)	90.8	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
21	DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)	90.4	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
22	OpenMath-Llama2-70B (w/ code, SC, k=50)	90.1	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
23	DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)	89.6	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
24	Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)	89.1	Yes	Math-Shepherd: Verify and Reinforce LLMs Step-by...	2023-12-14	Code
25	Llama SFT (Metamath ToRA Ensemble)	89	Yes	-	-	-
26	Minerva 62B (maj5@100)	89	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
27	Claude 3 Haiku (0-shot chain-of-thought)	88.9	No	-	-	-
28	ToRA-70B (SC, k=50)	88.3	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
29	DeepSeekMATH-RL-7B	88.2	Yes	DeepSeekMath: Pushing the Limits of Mathematical...	2024-02-05	Code
30	DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)	88.2	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
31	OpenMath-CodeLlama-34B (w/ code, SC, k=50)	88	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
32	Claude 2 (0-shot chain-of-thought)	88	No	-	-	-
33	Shivaay-4B (8-shot chain-of-thought)	87.41	No	-	-	-
34	DeepMind 70B Model (SFT+ORM-RL, ORM reranking)	87.3	Yes	Solving math word problems with process- and out...	2022-11-25	-
35	MMOS-DeepSeekMath-7B(0-shot,k=50)	87.2	Yes	An Empirical Study of Data Ability Boundary in L...	2024-02-23	Code
36	DeepMind 70B Model (SFT+PRM-RL, PRM reranking)	87.1	Yes	Solving math word problems with process- and out...	2022-11-25	-
37	GPT-4	87.1	No	Sparks of Artificial General Intelligence: Early...	2023-03-22	Code
38	OpenMath-Mistral-7B (w/ code, SC, k=50)	86.9	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
39	Orca-Math 7B (fine-tuned)	86.8	Yes	Orca-Math: Unlocking the potential of SLMs in Gr...	2024-02-16	-
40	DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)	86.8	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
41	OpenMath-CodeLlama-13B (w/ code, SC, k=50)	86.8	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
42	Gemini Pro (maj1@32)	86.5	No	Gemini: A Family of Highly Capable Multimodal Mo...	2023-12-19	Code
43	Codex (Self-Evaluation Guided Decoding, PAL, multiple reasoning chains, 9-shot gen, 5-shot eval)	85.5	No	-	-	-
44	Claude 1.3 (0-shot chain-of-thought)	85.2	No	-	-	-
45	ToRA-Code-34B (SC, k=50)	85.1	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
46	OpenMath-CodeLlama-7B (w/ code, SC, k=50)	84.8	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
47	OVM-Mistral-7B (verify100@1)	84.7	No	OVM, Outcome-supervised Value Models for Plannin...	2023-11-16	Code
48	OpenMath-Llama2-70B (w/ code)	84.7	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
49	OpenMath-CodeLlama-70B (w/ code)	84.6	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
50	code-davinci-002 175B (LEVER, 8-shot)	84.5	No	LEVER: Learning to Verify Language-to-Code Gener...	2023-02-16	Code
51	ToRA 70B	84.3	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
52	Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)	84.1	Yes	Math-Shepherd: Verify and Reinforce LLMs Step-by...	2023-12-14	Code
53	MathCoder-L-70B	83.9	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
54	WizardMath-7B-V1.1	83.2	Yes	WizardMath: Empowering Mathematical Reasoning fo...	2023-08-18	Code
55	DIVERSE 175B (8-shot)	83.2	No	Making Large Language Models Better Reasoners wi...	2022-06-06	-
56	OVM-Mistral-7B (verify20@1)	82.6	No	OVM, Outcome-supervised Value Models for Plannin...	2023-11-16	Code
57	DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)	82.6	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
58	ChatGPT (Ask, Refine, Trust)	82.6	No	The ART of LLM Refinement: Ask, Refine, and Trust	2023-11-14	-
59	DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)	82.5	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
60	MetaMath 70B	82.3	Yes	MetaMath: Bootstrap Your Own Mathematical Questi...	2023-09-21	Code
61	MuggleMATH 70B	82.3	Yes	MuggleMath: Assessing the Impact of Query and Re...	2023-10-09	Code
62	PaLM 540B (Self Improvement, Self Consistency)	82.1	No	Large Language Models Can Self-Improve	2022-10-20	-
63	MathCoder-CL-34B	81.7	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
64	WizardMath-70B-V1.0	81.6	Yes	WizardMath: Empowering Mathematical Reasoning fo...	2023-08-18	Code
65	Phi-GSM+V 1.3B+1.3B (verify48@1)	81.5	No	TinyGSM: achieving >80% on GSM8k with small lang...	2023-12-14	-
66	DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)	81.1	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
67	DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)	81.1	Yes	DART-Math: Difficulty-Aware Rejection Tuning for...	2024-06-18	Code
68	Claude Instant 1.1 (0-shot chain-of-thought)	80.9	No	-	-	-
69	ToRA-Code 34B	80.7	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
70	OpenMath-CodeLlama-34B (w/ code)	80.7	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
71	PaLM 2 (few-shot, k=8, CoT)	80.7	No	PaLM 2 Technical Report	2023-05-17	Code
72	MMOS-DeepSeekMath-7B(0-shot)	80.5	Yes	An Empirical Study of Data Ability Boundary in L...	2024-02-23	Code
73	MMOS-CODE-34B(0-shot)	80.4	Yes	An Empirical Study of Data Ability Boundary in L...	2024-02-23	Code
74	OpenMath-Mistral-7B (w/ code)	80.2	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
75	Self-Evaluation Guided Decoding (Codex, PAL, single reasoning chain, 9-shot gen, 5-shot eval)	80.2	No	-	-	-
76	OpenMath-CodeLlama-13B (w/ code)	78.8	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
77	Minerva 540B (CoT)	78.5	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
78	Camelidae-8×34B (5-shot)	78.3	No	Parameter-Efficient Sparsity Crafting from Dense...	2024-01-05	Code
79	Qwen2idae-16x14B (5-shot)	77.8	No	Parameter-Efficient Sparsity Crafting from Dense...	2024-01-05	Code
80	MetaMath-Mistral-7B	77.7	Yes	MetaMath: Bootstrap Your Own Mathematical Questi...	2023-09-21	Code
81	OpenChat-3.5 7B	77.3	No	OpenChat: Advancing Open-source Language Models ...	2023-09-20	Code
82	DeepMind 70B Model (STaR, maj1@96)	76.5	Yes	Solving math word problems with process- and out...	2022-11-25	-
83	Arithmo2-Mistral-7B	76.4	No	-	-	-
84	OpenMath-CodeLlama-7B (w/ code)	75.9	Yes	OpenMathInstruct-1: A 1.8 Million Math Instructi...	2024-02-15	Code
85	ToRA-Code 13B	75.8	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
86	Arithmo-Mistral-7B	74.7	No	-	-	-
87	PaLM 540B maj1@40 (8-shot)	74.4	Yes	Self-Consistency Improves Chain of Thought Reaso...	2022-03-21	Code
88	PaLM 540B (Self Consistency)	74.4	No	Large Language Models Can Self-Improve	2022-10-20	-
89	Phi-GSM 2.7B (fine-tuned)	74.3	No	TinyGSM: achieving >80% on GSM8k with small lang...	2023-12-14	-
90	MathCoder-CL-13B	74.1	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
91	MuggleMATH 13B	74	Yes	MuggleMath: Assessing the Impact of Query and Re...	2023-10-09	Code
92	MMOS-CODE-7B(0-shot)	73.9	Yes	An Empirical Study of Data Ability Boundary in L...	2024-02-23	Code
93	CodeT5+	73.8	No	CodeT5+: Open Code Large Language Models for Cod...	2023-05-13	Code
94	Llama-3.3-70B + CAPO	73.73	No	CAPO: Cost-Aware Prompt Optimization	2025-04-22	Code
95	OVM-Llama2-7B (verify100@1)	73.7	No	OVM, Outcome-supervised Value Models for Plannin...	2023-11-16	Code
96	PaLM 540B (Self Improvement, CoT Prompting)	73.5	No	Large Language Models Can Self-Improve	2022-10-20	-
97	KwaiYiiMath 13B	73.3	Yes	KwaiYiiMath: Technical Report	2023-10-11	-
98	ToRA-Code 7B	72.6	Yes	ToRA: A Tool-Integrated Reasoning Agent for Math...	2023-09-29	Code
99	MathCoder-L-13B	72.6	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
100	DBRX Base 132B	72.3	No	-	-	-
101	Self-Evaluation Guided Decoding (Codex, CoT, single reasoning chain, 9-shot gen, 5-shot eval)	71.9	No	-	-	-
102	MetaMath 13B	71	Yes	MetaMath: Bootstrap Your Own Mathematical Questi...	2023-09-21	Code
103	MuggleMATH 7B	69.8	Yes	MuggleMath: Assessing the Impact of Query and Re...	2023-10-09	Code
104	LLaMA 65B-maj1@k	69.7	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
105	Minerva 62B (maj1@100)	68.5	Yes	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
106	code-davinci-002 (Least-to-Most Prompting)	68.01	No	Least-to-Most Prompting Enables Complex Reasonin...	2022-05-21	Code
107	MathCoder-CL-7B	67.8	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
108	DBRX Instruct 132B	66.9	No	-	-	-
109	MetaMath 7B	66.4	Yes	MetaMath: Bootstrap Your Own Mathematical Questi...	2023-09-21	Code
110	Mistral-Small-24B + CAPO	65.07	No	CAPO: Cost-Aware Prompt Optimization	2025-04-22	Code
111	RFT 70B	64.8	Yes	Scaling Relationship on Learning Mathematical Re...	2023-08-03	Code
112	MathCoder-L-7B	64.2	Yes	MathCoder: Seamless Code Integration in LLMs for...	2023-10-05	Code
113	WizardMath-13B-V1.0	63.9	Yes	WizardMath: Empowering Mathematical Reasoning fo...	2023-08-18	Code
114	GPT-J (CoRe)	63.2	No	Solving Math Word Problems via Cooperative Reaso...	2022-10-28	Code
115	Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)	61	No	The Unreasonable Effectiveness of Eccentric Auto...	2024-02-09	-
116	Qwen2.5-32B + CAPO	60.2	No	CAPO: Cost-Aware Prompt Optimization	2025-04-22	Code
117	LLaMA 2 70B (CoT-Influx)	59.59	No	Fewer is More: Boosting LLM Reasoning with Reinf...	2023-12-14	-
118	Orca 2 13B	59.14	No	Orca 2: Teaching Small Language Models How to Re...	2023-11-18	-
119	U-PaLM	58.5	No	Transcending Scaling Laws with 0.1% Extra Compute	2022-10-20	-
120	PaLM-540B (few-Shot-cot)	58.1	Yes	Large Language Models are Zero-Shot Reasoners	2022-05-24	Code
121	GPT-3.5 (few-shot, k=5)	57.1	No	GPT-4 Technical Report	2023-03-15	Code
122	Minerva 8B (maj5@100)	56.8	No	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
123	LLaMA 2 70B (on-shot)	56.8	No	Llama 2: Open Foundation and Fine-Tuned Chat Mod...	2023-07-18	Code
124	PaLM 540B (8-shot)	56.5	Yes	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
125	PaLM 540B (CoT Prompting)	56.5	No	Large Language Models Can Self-Improve	2022-10-20	-
126	RFT 13B	55.3	Yes	Scaling Relationship on Learning Mathematical Re...	2023-08-03	Code
127	Finetuned GPT-3 175B + verifier	55	Yes	Large Language Models are Zero-Shot Reasoners	2022-05-24	Code
128	WizardMath-7B-V1.0	54.9	Yes	WizardMath: Empowering Mathematical Reasoning fo...	2023-08-18	Code
129	LLaMA 33B-maj1@k	53.1	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
130	Minerva 62B (8-shot)	52.4	Yes	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
131	Mistral 7B (maj@8)	52.2	No	Mistral 7B	2023-10-10	Code
132	Llemma 34B	51.5	No	Llemma: An Open Language Model For Mathematics	2023-10-16	Code
133	Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))	51.5	Yes	Large Language Models are Zero-Shot Reasoners	2022-05-24	Code
134	RFT 7B	51.2	Yes	Scaling Relationship on Learning Mathematical Re...	2023-08-03	Code
135	LLaMA 65B	50.9	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
136	Orca 2 7B	47.23	No	Orca 2: Teaching Small Language Models How to Re...	2023-11-18	-
137	Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)	43	No	The Unreasonable Effectiveness of Eccentric Auto...	2024-02-09	-
138	text-davinci-002 175B (2-shot, CoT)	41.3	Yes	Large Language Models are Zero-Shot Reasoners	2022-05-24	Code
139	Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)	41	No	The Unreasonable Effectiveness of Eccentric Auto...	2024-02-09	-
140	text-davinci-002 175B (0-shot, CoT)	40.7	Yes	Large Language Models are Zero-Shot Reasoners	2022-05-24	Code
141	Branch-Train-MiX 4x7B (sampling top-2 experts)	37.1	No	Branch-Train-MiX: Mixing Expert LLMs into a Mixt...	2024-03-12	Code
142	Llemma 7B	36.4	No	Llemma: An Open Language Model For Mathematics	2023-10-16	Code
143	LLaMA 33B	35.6	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
144	Vicuna (SYRELM)	35.2	Yes	Frugal LMs Trained to Invoke Symbolic Solvers Ac...	2023-12-09	Code
145	PaLM 62B (8-shot)	33	Yes	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
146	PaLM 540B (Self Improvement, Standard-Prompting)	32.2	No	Large Language Models Can Self-Improve	2022-10-20	-
147	LLaMA 13B-maj1@k	29.3	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
148	Minerva 8B-maj1@k (8-shot)	28.4	Yes	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
149	GPT-2-Medium 355M + question-solution classifier (BS=5)	20.8	No	Composing Ensembles of Pre-trained Models via It...	2022-10-20	-
150	GPT-Neo-2.7B + Self-Sampling	19.5	No	Learning Math Reasoning from Self-Sampled Correc...	2022-05-28	Code
151	GPT-2-Medium 355M (fine-tuned, BS=5)	18.3	No	Composing Ensembles of Pre-trained Models via It...	2022-10-20	-
152	LLaMA 7B (maj1@k)	18.1	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
153	PaLM 540B (few-shot)	17.9	Yes	Large Language Models are Zero-Shot Reasoners	2022-05-24	Code
154	PaLM 540B (Standard-Prompting)	17.9	No	Large Language Models Can Self-Improve	2022-10-20	-
155	LLaMA 13B	17.8	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
156	GPT-2-Medium 355M + question-solution classifier (BS=1)	16.8	No	Composing Ensembles of Pre-trained Models via It...	2022-10-20	-
157	Minerva 8B (8-shot)	16.2	Yes	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
158	GPT-2-Medium 355M (BS=5)	12.2	No	Composing Ensembles of Pre-trained Models via It...	2022-10-20	-
159	LLaMA 7B	11	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
160	Text-davinci-002-175B (0-shot)	10.4	Yes	Large Language Models are Zero-Shot Reasoners	2022-05-24	Code
161	GPT-Neo 125M + Self-Sampling	7.5	No	Learning Math Reasoning from Self-Sampled Correc...	2022-05-28	Code
162	UL2 20B (chain-of-thought)	4.4	No	UL2: Unifying Language Learning Paradigms	2022-05-10	Code
163	PaLM 8B (8-shot)	4.1	Yes	Solving Quantitative Reasoning Problems with Lan...	2022-06-29	Code
164	UL2 20B (0-shot)	4.1	No	UL2: Unifying Language Learning Paradigms	2022-05-10	Code