TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Measuring Mathematical Problem Solving With the MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt

2021-03-05Mathematical ReasoningMathMath Word Problem SolvingText GenerationMathematical Problem-Solving
PaperPDFCodeCode(official)CodeCodeCode

Abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

Results

TaskDatasetMetricValueModel
Question AnsweringMATHAccuracy6.9GPT-2 (1.5B)
Question AnsweringMATHParameters (Billions)1.5GPT-2 (1.5B)
Question AnsweringMATHAccuracy6.4GPT-2 (0.7B)
Question AnsweringMATHParameters (Billions)0.7GPT-2 (0.7B)
Question AnsweringMATHAccuracy6.2GPT-2 (0.3B)
Question AnsweringMATHParameters (Billions)0.3GPT-2 (0.3B)
Question AnsweringMATHAccuracy5.6GPT-3 13B
Question AnsweringMATHParameters (Billions)13GPT-3 13B
Question AnsweringMATHAccuracy5.4GPT-2 (0.1B)
Question AnsweringMATHParameters (Billions)0.1GPT-2 (0.1B)
Question AnsweringMATHAccuracy5.2GPT-3-175B (few-shot)
Question AnsweringMATHParameters (Billions)175GPT-3-175B (few-shot)
Question AnsweringMATHAccuracy3GPT-3-13B (few-shot)
Question AnsweringMATHParameters (Billions)13GPT-3-13B (few-shot)
Question AnsweringMATHAccuracy2.9GPT-3 2.7B
Question AnsweringMATHParameters (Billions)2.7GPT-3 2.7B
Math Word Problem SolvingMATHAccuracy6.9GPT-2 (1.5B)
Math Word Problem SolvingMATHParameters (Billions)1.5GPT-2 (1.5B)
Math Word Problem SolvingMATHAccuracy6.4GPT-2 (0.7B)
Math Word Problem SolvingMATHParameters (Billions)0.7GPT-2 (0.7B)
Math Word Problem SolvingMATHAccuracy6.2GPT-2 (0.3B)
Math Word Problem SolvingMATHParameters (Billions)0.3GPT-2 (0.3B)
Math Word Problem SolvingMATHAccuracy5.6GPT-3 13B
Math Word Problem SolvingMATHParameters (Billions)13GPT-3 13B
Math Word Problem SolvingMATHAccuracy5.4GPT-2 (0.1B)
Math Word Problem SolvingMATHParameters (Billions)0.1GPT-2 (0.1B)
Math Word Problem SolvingMATHAccuracy5.2GPT-3-175B (few-shot)
Math Word Problem SolvingMATHParameters (Billions)175GPT-3-175B (few-shot)
Math Word Problem SolvingMATHAccuracy3GPT-3-13B (few-shot)
Math Word Problem SolvingMATHParameters (Billions)13GPT-3-13B (few-shot)
Math Word Problem SolvingMATHAccuracy2.9GPT-3 2.7B
Math Word Problem SolvingMATHParameters (Billions)2.7GPT-3 2.7B
Mathematical Question AnsweringMATHAccuracy6.9GPT-2 (1.5B)
Mathematical Question AnsweringMATHParameters (Billions)1.5GPT-2 (1.5B)
Mathematical Question AnsweringMATHAccuracy6.4GPT-2 (0.7B)
Mathematical Question AnsweringMATHParameters (Billions)0.7GPT-2 (0.7B)
Mathematical Question AnsweringMATHAccuracy6.2GPT-2 (0.3B)
Mathematical Question AnsweringMATHParameters (Billions)0.3GPT-2 (0.3B)
Mathematical Question AnsweringMATHAccuracy5.6GPT-3 13B
Mathematical Question AnsweringMATHParameters (Billions)13GPT-3 13B
Mathematical Question AnsweringMATHAccuracy5.4GPT-2 (0.1B)
Mathematical Question AnsweringMATHParameters (Billions)0.1GPT-2 (0.1B)
Mathematical Question AnsweringMATHAccuracy5.2GPT-3-175B (few-shot)
Mathematical Question AnsweringMATHParameters (Billions)175GPT-3-175B (few-shot)
Mathematical Question AnsweringMATHAccuracy3GPT-3-13B (few-shot)
Mathematical Question AnsweringMATHParameters (Billions)13GPT-3-13B (few-shot)
Mathematical Question AnsweringMATHAccuracy2.9GPT-3 2.7B
Mathematical Question AnsweringMATHParameters (Billions)2.7GPT-3 2.7B
Mathematical ReasoningMATHAccuracy6.9GPT-2 (1.5B)
Mathematical ReasoningMATHParameters (Billions)1.5GPT-2 (1.5B)
Mathematical ReasoningMATHAccuracy6.4GPT-2 (0.7B)
Mathematical ReasoningMATHParameters (Billions)0.7GPT-2 (0.7B)
Mathematical ReasoningMATHAccuracy6.2GPT-2 (0.3B)
Mathematical ReasoningMATHParameters (Billions)0.3GPT-2 (0.3B)
Mathematical ReasoningMATHAccuracy5.6GPT-3 13B
Mathematical ReasoningMATHParameters (Billions)13GPT-3 13B
Mathematical ReasoningMATHAccuracy5.4GPT-2 (0.1B)
Mathematical ReasoningMATHParameters (Billions)0.1GPT-2 (0.1B)
Mathematical ReasoningMATHAccuracy5.2GPT-3-175B (few-shot)
Mathematical ReasoningMATHParameters (Billions)175GPT-3-175B (few-shot)
Mathematical ReasoningMATHAccuracy3GPT-3-13B (few-shot)
Mathematical ReasoningMATHParameters (Billions)13GPT-3-13B (few-shot)
Mathematical ReasoningMATHAccuracy2.9GPT-3 2.7B
Mathematical ReasoningMATHParameters (Billions)2.7GPT-3 2.7B

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17A Survey of Deep Learning for Geometry Problem Solving2025-07-16Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?2025-07-15Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15