Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems

Qihuang Zhong, Kang Wang, Ziyang Xu, Juhua Liu, Liang Ding, Bo Du

2024-04-23Math Math Word Problem Solving GSM8K Arithmetic Reasoning

Abstract

Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks. However, CoT still falls short in dealing with complex math word problems, as it usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors. Prior studies involve addressing the calculation errors and step-missing errors, but neglect the semantic misunderstanding errors, which is the major factor limiting the reasoning performance of LLMs. To this end, we propose a simple-yet-effective method, namely Deeply Understanding the Problems (DUP), to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors. The core of our method is to encourage the LLMs to deeply understand the problems and extract the key problem-solving information used for better reasoning. Extensive experiments on 10 diverse reasoning benchmarks show that our DUP method consistently outperforms the other counterparts by a large margin. More encouragingly, DUP achieves a new SOTA result on the GSM8K benchmark, with an accuracy of 97.1% under the zero-shot setting.

Results

Task	Dataset	Metric	Value	Model
Question Answering	SVAMP	Accuracy	94.2	GPT-4 DUP
Math Word Problem Solving	SVAMP	Accuracy	94.2	GPT-4 DUP
Mathematical Question Answering	SVAMP	Accuracy	94.2	GPT-4 DUP
Mathematical Reasoning	SVAMP	Accuracy	94.2	GPT-4 DUP
Arithmetic Reasoning	GSM8K	Accuracy	97.1	DUP prompt upon GPT-4

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17 QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17 GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems2025-07-17 Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16 DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression2025-07-16 Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15 Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing2025-07-15 KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?2025-07-15