TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Achieving >97% on GSM8K: Deeply Understanding the Problems...

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems

Qihuang Zhong, Kang Wang, Ziyang Xu, Juhua Liu, Liang Ding, Bo Du

2024-04-23MathMath Word Problem SolvingGSM8KArithmetic Reasoning
PaperPDFCode(official)

Abstract

Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks. However, CoT still falls short in dealing with complex math word problems, as it usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors. Prior studies involve addressing the calculation errors and step-missing errors, but neglect the semantic misunderstanding errors, which is the major factor limiting the reasoning performance of LLMs. To this end, we propose a simple-yet-effective method, namely Deeply Understanding the Problems (DUP), to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors. The core of our method is to encourage the LLMs to deeply understand the problems and extract the key problem-solving information used for better reasoning. Extensive experiments on 10 diverse reasoning benchmarks show that our DUP method consistently outperforms the other counterparts by a large margin. More encouragingly, DUP achieves a new SOTA result on the GSM8K benchmark, with an accuracy of 97.1% under the zero-shot setting.

Results

TaskDatasetMetricValueModel
Question AnsweringSVAMPAccuracy94.2GPT-4 DUP
Math Word Problem SolvingSVAMPAccuracy94.2GPT-4 DUP
Mathematical Question AnsweringSVAMPAccuracy94.2GPT-4 DUP
Mathematical ReasoningSVAMPAccuracy94.2GPT-4 DUP
Arithmetic ReasoningGSM8KAccuracy97.1DUP prompt upon GPT-4

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems2025-07-17Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression2025-07-16Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing2025-07-15KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?2025-07-15