Are NLP Models really able to Solve Simple Math Word Problems?

Arkil Patel, Satwik Bhattamishra, Navin Goyal

2021-03-12NAACL 2021 4Math Word Problem SolvingΩ Math Math Word Problem Solving

Abstract

The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered "solved" with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performance on the benchmark datasets. To this end, we show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. Further, we introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over examples sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.

Results

Task	Dataset	Metric	Value	Model
Question Answering	MAWPS	Accuracy (%)	88.7	Graph2Tree with RoBERTa
Question Answering	MAWPS	Accuracy (%)	88.5	GTS with RoBERTa
Question Answering	ASDiv-A	Execution Accuracy	82.2	Graph2Tree with RoBERTa
Question Answering	ASDiv-A	Execution Accuracy	81.2	GTS with RoBERTa
Question Answering	ASDiv-A	Execution Accuracy	76.9	LSTM Seq2Seq with RoBERTa
Question Answering	SVAMP	Accuracy	43.8	Graph2Tree with RoBERTa
Question Answering	SVAMP	Execution Accuracy	43.8	Graph2Tree with RoBERTa
Question Answering	SVAMP	Accuracy	41	GTS with RoBERTa
Question Answering	SVAMP	Execution Accuracy	41	GTS with RoBERTa
Question Answering	SVAMP	Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Question Answering	SVAMP	Execution Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Question Answering	SVAMP	Accuracy	38.9	Transformer with RoBERTa
Question Answering	SVAMP	Execution Accuracy	38.9	Transformer with RoBERTa
Math Word Problem Solving	MAWPS	Accuracy (%)	88.7	Graph2Tree with RoBERTa
Math Word Problem Solving	MAWPS	Accuracy (%)	88.5	GTS with RoBERTa
Math Word Problem Solving	ASDiv-A	Execution Accuracy	82.2	Graph2Tree with RoBERTa
Math Word Problem Solving	ASDiv-A	Execution Accuracy	81.2	GTS with RoBERTa
Math Word Problem Solving	ASDiv-A	Execution Accuracy	76.9	LSTM Seq2Seq with RoBERTa
Math Word Problem Solving	SVAMP	Accuracy	43.8	Graph2Tree with RoBERTa
Math Word Problem Solving	SVAMP	Execution Accuracy	43.8	Graph2Tree with RoBERTa
Math Word Problem Solving	SVAMP	Accuracy	41	GTS with RoBERTa
Math Word Problem Solving	SVAMP	Execution Accuracy	41	GTS with RoBERTa
Math Word Problem Solving	SVAMP	Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Math Word Problem Solving	SVAMP	Execution Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Math Word Problem Solving	SVAMP	Accuracy	38.9	Transformer with RoBERTa
Math Word Problem Solving	SVAMP	Execution Accuracy	38.9	Transformer with RoBERTa
Mathematical Question Answering	MAWPS	Accuracy (%)	88.7	Graph2Tree with RoBERTa
Mathematical Question Answering	MAWPS	Accuracy (%)	88.5	GTS with RoBERTa
Mathematical Question Answering	ASDiv-A	Execution Accuracy	82.2	Graph2Tree with RoBERTa
Mathematical Question Answering	ASDiv-A	Execution Accuracy	81.2	GTS with RoBERTa
Mathematical Question Answering	ASDiv-A	Execution Accuracy	76.9	LSTM Seq2Seq with RoBERTa
Mathematical Question Answering	SVAMP	Accuracy	43.8	Graph2Tree with RoBERTa
Mathematical Question Answering	SVAMP	Execution Accuracy	43.8	Graph2Tree with RoBERTa
Mathematical Question Answering	SVAMP	Accuracy	41	GTS with RoBERTa
Mathematical Question Answering	SVAMP	Execution Accuracy	41	GTS with RoBERTa
Mathematical Question Answering	SVAMP	Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Mathematical Question Answering	SVAMP	Execution Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Mathematical Question Answering	SVAMP	Accuracy	38.9	Transformer with RoBERTa
Mathematical Question Answering	SVAMP	Execution Accuracy	38.9	Transformer with RoBERTa
Mathematical Reasoning	MAWPS	Accuracy (%)	88.7	Graph2Tree with RoBERTa
Mathematical Reasoning	MAWPS	Accuracy (%)	88.5	GTS with RoBERTa
Mathematical Reasoning	ASDiv-A	Execution Accuracy	82.2	Graph2Tree with RoBERTa
Mathematical Reasoning	ASDiv-A	Execution Accuracy	81.2	GTS with RoBERTa
Mathematical Reasoning	ASDiv-A	Execution Accuracy	76.9	LSTM Seq2Seq with RoBERTa
Mathematical Reasoning	SVAMP	Accuracy	43.8	Graph2Tree with RoBERTa
Mathematical Reasoning	SVAMP	Execution Accuracy	43.8	Graph2Tree with RoBERTa
Mathematical Reasoning	SVAMP	Accuracy	41	GTS with RoBERTa
Mathematical Reasoning	SVAMP	Execution Accuracy	41	GTS with RoBERTa
Mathematical Reasoning	SVAMP	Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Mathematical Reasoning	SVAMP	Execution Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Mathematical Reasoning	SVAMP	Accuracy	38.9	Transformer with RoBERTa
Mathematical Reasoning	SVAMP	Execution Accuracy	38.9	Transformer with RoBERTa

Abstract

Results

Task	Dataset	Metric	Value	Model
Question Answering	MAWPS	Accuracy (%)	88.7	Graph2Tree with RoBERTa
Question Answering	MAWPS	Accuracy (%)	88.5	GTS with RoBERTa
Question Answering	ASDiv-A	Execution Accuracy	82.2	Graph2Tree with RoBERTa
Question Answering	ASDiv-A	Execution Accuracy	81.2	GTS with RoBERTa
Question Answering	ASDiv-A	Execution Accuracy	76.9	LSTM Seq2Seq with RoBERTa
Question Answering	SVAMP	Accuracy	43.8	Graph2Tree with RoBERTa
Question Answering	SVAMP	Execution Accuracy	43.8	Graph2Tree with RoBERTa
Question Answering	SVAMP	Accuracy	41	GTS with RoBERTa
Question Answering	SVAMP	Execution Accuracy	41	GTS with RoBERTa
Question Answering	SVAMP	Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Question Answering	SVAMP	Execution Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Question Answering	SVAMP	Accuracy	38.9	Transformer with RoBERTa
Question Answering	SVAMP	Execution Accuracy	38.9	Transformer with RoBERTa
Math Word Problem Solving	MAWPS	Accuracy (%)	88.7	Graph2Tree with RoBERTa
Math Word Problem Solving	MAWPS	Accuracy (%)	88.5	GTS with RoBERTa
Math Word Problem Solving	ASDiv-A	Execution Accuracy	82.2	Graph2Tree with RoBERTa
Math Word Problem Solving	ASDiv-A	Execution Accuracy	81.2	GTS with RoBERTa
Math Word Problem Solving	ASDiv-A	Execution Accuracy	76.9	LSTM Seq2Seq with RoBERTa
Math Word Problem Solving	SVAMP	Accuracy	43.8	Graph2Tree with RoBERTa
Math Word Problem Solving	SVAMP	Execution Accuracy	43.8	Graph2Tree with RoBERTa
Math Word Problem Solving	SVAMP	Accuracy	41	GTS with RoBERTa
Math Word Problem Solving	SVAMP	Execution Accuracy	41	GTS with RoBERTa
Math Word Problem Solving	SVAMP	Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Math Word Problem Solving	SVAMP	Execution Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Math Word Problem Solving	SVAMP	Accuracy	38.9	Transformer with RoBERTa
Math Word Problem Solving	SVAMP	Execution Accuracy	38.9	Transformer with RoBERTa
Mathematical Question Answering	MAWPS	Accuracy (%)	88.7	Graph2Tree with RoBERTa
Mathematical Question Answering	MAWPS	Accuracy (%)	88.5	GTS with RoBERTa
Mathematical Question Answering	ASDiv-A	Execution Accuracy	82.2	Graph2Tree with RoBERTa
Mathematical Question Answering	ASDiv-A	Execution Accuracy	81.2	GTS with RoBERTa
Mathematical Question Answering	ASDiv-A	Execution Accuracy	76.9	LSTM Seq2Seq with RoBERTa
Mathematical Question Answering	SVAMP	Accuracy	43.8	Graph2Tree with RoBERTa
Mathematical Question Answering	SVAMP	Execution Accuracy	43.8	Graph2Tree with RoBERTa
Mathematical Question Answering	SVAMP	Accuracy	41	GTS with RoBERTa
Mathematical Question Answering	SVAMP	Execution Accuracy	41	GTS with RoBERTa
Mathematical Question Answering	SVAMP	Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Mathematical Question Answering	SVAMP	Execution Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Mathematical Question Answering	SVAMP	Accuracy	38.9	Transformer with RoBERTa
Mathematical Question Answering	SVAMP	Execution Accuracy	38.9	Transformer with RoBERTa
Mathematical Reasoning	MAWPS	Accuracy (%)	88.7	Graph2Tree with RoBERTa
Mathematical Reasoning	MAWPS	Accuracy (%)	88.5	GTS with RoBERTa
Mathematical Reasoning	ASDiv-A	Execution Accuracy	82.2	Graph2Tree with RoBERTa
Mathematical Reasoning	ASDiv-A	Execution Accuracy	81.2	GTS with RoBERTa
Mathematical Reasoning	ASDiv-A	Execution Accuracy	76.9	LSTM Seq2Seq with RoBERTa
Mathematical Reasoning	SVAMP	Accuracy	43.8	Graph2Tree with RoBERTa
Mathematical Reasoning	SVAMP	Execution Accuracy	43.8	Graph2Tree with RoBERTa
Mathematical Reasoning	SVAMP	Accuracy	41	GTS with RoBERTa
Mathematical Reasoning	SVAMP	Execution Accuracy	41	GTS with RoBERTa
Mathematical Reasoning	SVAMP	Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Mathematical Reasoning	SVAMP	Execution Accuracy	40.3	LSTM Seq2Seq with RoBERTa
Mathematical Reasoning	SVAMP	Accuracy	38.9	Transformer with RoBERTa
Mathematical Reasoning	SVAMP	Execution Accuracy	38.9	Transformer with RoBERTa

Are NLP Models really able to Solve Simple Math Word Problems?

Abstract

Results

Related Papers

Are NLP Models really able to Solve Simple Math Word Problems?

Abstract

Results

Related Papers