TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Are NLP Models really able to Solve Simple Math Word Probl...

Are NLP Models really able to Solve Simple Math Word Problems?

Arkil Patel, Satwik Bhattamishra, Navin Goyal

2021-03-12NAACL 2021 4Math Word Problem SolvingΩMathMath Word Problem Solving
PaperPDFCode(official)CodeCode

Abstract

The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered "solved" with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performance on the benchmark datasets. To this end, we show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. Further, we introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over examples sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.

Results

TaskDatasetMetricValueModel
Question AnsweringMAWPSAccuracy (%)88.7Graph2Tree with RoBERTa
Question AnsweringMAWPSAccuracy (%)88.5GTS with RoBERTa
Question AnsweringASDiv-AExecution Accuracy82.2Graph2Tree with RoBERTa
Question AnsweringASDiv-AExecution Accuracy81.2GTS with RoBERTa
Question AnsweringASDiv-AExecution Accuracy76.9LSTM Seq2Seq with RoBERTa
Question AnsweringSVAMPAccuracy43.8Graph2Tree with RoBERTa
Question AnsweringSVAMPExecution Accuracy43.8Graph2Tree with RoBERTa
Question AnsweringSVAMPAccuracy41GTS with RoBERTa
Question AnsweringSVAMPExecution Accuracy41GTS with RoBERTa
Question AnsweringSVAMPAccuracy40.3LSTM Seq2Seq with RoBERTa
Question AnsweringSVAMPExecution Accuracy40.3LSTM Seq2Seq with RoBERTa
Question AnsweringSVAMPAccuracy38.9Transformer with RoBERTa
Question AnsweringSVAMPExecution Accuracy38.9Transformer with RoBERTa
Math Word Problem SolvingMAWPSAccuracy (%)88.7Graph2Tree with RoBERTa
Math Word Problem SolvingMAWPSAccuracy (%)88.5GTS with RoBERTa
Math Word Problem SolvingASDiv-AExecution Accuracy82.2Graph2Tree with RoBERTa
Math Word Problem SolvingASDiv-AExecution Accuracy81.2GTS with RoBERTa
Math Word Problem SolvingASDiv-AExecution Accuracy76.9LSTM Seq2Seq with RoBERTa
Math Word Problem SolvingSVAMPAccuracy43.8Graph2Tree with RoBERTa
Math Word Problem SolvingSVAMPExecution Accuracy43.8Graph2Tree with RoBERTa
Math Word Problem SolvingSVAMPAccuracy41GTS with RoBERTa
Math Word Problem SolvingSVAMPExecution Accuracy41GTS with RoBERTa
Math Word Problem SolvingSVAMPAccuracy40.3LSTM Seq2Seq with RoBERTa
Math Word Problem SolvingSVAMPExecution Accuracy40.3LSTM Seq2Seq with RoBERTa
Math Word Problem SolvingSVAMPAccuracy38.9Transformer with RoBERTa
Math Word Problem SolvingSVAMPExecution Accuracy38.9Transformer with RoBERTa
Mathematical Question AnsweringMAWPSAccuracy (%)88.7Graph2Tree with RoBERTa
Mathematical Question AnsweringMAWPSAccuracy (%)88.5GTS with RoBERTa
Mathematical Question AnsweringASDiv-AExecution Accuracy82.2Graph2Tree with RoBERTa
Mathematical Question AnsweringASDiv-AExecution Accuracy81.2GTS with RoBERTa
Mathematical Question AnsweringASDiv-AExecution Accuracy76.9LSTM Seq2Seq with RoBERTa
Mathematical Question AnsweringSVAMPAccuracy43.8Graph2Tree with RoBERTa
Mathematical Question AnsweringSVAMPExecution Accuracy43.8Graph2Tree with RoBERTa
Mathematical Question AnsweringSVAMPAccuracy41GTS with RoBERTa
Mathematical Question AnsweringSVAMPExecution Accuracy41GTS with RoBERTa
Mathematical Question AnsweringSVAMPAccuracy40.3LSTM Seq2Seq with RoBERTa
Mathematical Question AnsweringSVAMPExecution Accuracy40.3LSTM Seq2Seq with RoBERTa
Mathematical Question AnsweringSVAMPAccuracy38.9Transformer with RoBERTa
Mathematical Question AnsweringSVAMPExecution Accuracy38.9Transformer with RoBERTa
Mathematical ReasoningMAWPSAccuracy (%)88.7Graph2Tree with RoBERTa
Mathematical ReasoningMAWPSAccuracy (%)88.5GTS with RoBERTa
Mathematical ReasoningASDiv-AExecution Accuracy82.2Graph2Tree with RoBERTa
Mathematical ReasoningASDiv-AExecution Accuracy81.2GTS with RoBERTa
Mathematical ReasoningASDiv-AExecution Accuracy76.9LSTM Seq2Seq with RoBERTa
Mathematical ReasoningSVAMPAccuracy43.8Graph2Tree with RoBERTa
Mathematical ReasoningSVAMPExecution Accuracy43.8Graph2Tree with RoBERTa
Mathematical ReasoningSVAMPAccuracy41GTS with RoBERTa
Mathematical ReasoningSVAMPExecution Accuracy41GTS with RoBERTa
Mathematical ReasoningSVAMPAccuracy40.3LSTM Seq2Seq with RoBERTa
Mathematical ReasoningSVAMPExecution Accuracy40.3LSTM Seq2Seq with RoBERTa
Mathematical ReasoningSVAMPAccuracy38.9Transformer with RoBERTa
Mathematical ReasoningSVAMPExecution Accuracy38.9Transformer with RoBERTa

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing2025-07-15Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination2025-07-14A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning2025-07-11Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs2025-07-10