Arkil Patel, Satwik Bhattamishra, Navin Goyal
The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered "solved" with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performance on the benchmark datasets. To this end, we show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. Further, we introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over examples sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | MAWPS | Accuracy (%) | 88.7 | Graph2Tree with RoBERTa |
| Question Answering | MAWPS | Accuracy (%) | 88.5 | GTS with RoBERTa |
| Question Answering | ASDiv-A | Execution Accuracy | 82.2 | Graph2Tree with RoBERTa |
| Question Answering | ASDiv-A | Execution Accuracy | 81.2 | GTS with RoBERTa |
| Question Answering | ASDiv-A | Execution Accuracy | 76.9 | LSTM Seq2Seq with RoBERTa |
| Question Answering | SVAMP | Accuracy | 43.8 | Graph2Tree with RoBERTa |
| Question Answering | SVAMP | Execution Accuracy | 43.8 | Graph2Tree with RoBERTa |
| Question Answering | SVAMP | Accuracy | 41 | GTS with RoBERTa |
| Question Answering | SVAMP | Execution Accuracy | 41 | GTS with RoBERTa |
| Question Answering | SVAMP | Accuracy | 40.3 | LSTM Seq2Seq with RoBERTa |
| Question Answering | SVAMP | Execution Accuracy | 40.3 | LSTM Seq2Seq with RoBERTa |
| Question Answering | SVAMP | Accuracy | 38.9 | Transformer with RoBERTa |
| Question Answering | SVAMP | Execution Accuracy | 38.9 | Transformer with RoBERTa |
| Math Word Problem Solving | MAWPS | Accuracy (%) | 88.7 | Graph2Tree with RoBERTa |
| Math Word Problem Solving | MAWPS | Accuracy (%) | 88.5 | GTS with RoBERTa |
| Math Word Problem Solving | ASDiv-A | Execution Accuracy | 82.2 | Graph2Tree with RoBERTa |
| Math Word Problem Solving | ASDiv-A | Execution Accuracy | 81.2 | GTS with RoBERTa |
| Math Word Problem Solving | ASDiv-A | Execution Accuracy | 76.9 | LSTM Seq2Seq with RoBERTa |
| Math Word Problem Solving | SVAMP | Accuracy | 43.8 | Graph2Tree with RoBERTa |
| Math Word Problem Solving | SVAMP | Execution Accuracy | 43.8 | Graph2Tree with RoBERTa |
| Math Word Problem Solving | SVAMP | Accuracy | 41 | GTS with RoBERTa |
| Math Word Problem Solving | SVAMP | Execution Accuracy | 41 | GTS with RoBERTa |
| Math Word Problem Solving | SVAMP | Accuracy | 40.3 | LSTM Seq2Seq with RoBERTa |
| Math Word Problem Solving | SVAMP | Execution Accuracy | 40.3 | LSTM Seq2Seq with RoBERTa |
| Math Word Problem Solving | SVAMP | Accuracy | 38.9 | Transformer with RoBERTa |
| Math Word Problem Solving | SVAMP | Execution Accuracy | 38.9 | Transformer with RoBERTa |
| Mathematical Question Answering | MAWPS | Accuracy (%) | 88.7 | Graph2Tree with RoBERTa |
| Mathematical Question Answering | MAWPS | Accuracy (%) | 88.5 | GTS with RoBERTa |
| Mathematical Question Answering | ASDiv-A | Execution Accuracy | 82.2 | Graph2Tree with RoBERTa |
| Mathematical Question Answering | ASDiv-A | Execution Accuracy | 81.2 | GTS with RoBERTa |
| Mathematical Question Answering | ASDiv-A | Execution Accuracy | 76.9 | LSTM Seq2Seq with RoBERTa |
| Mathematical Question Answering | SVAMP | Accuracy | 43.8 | Graph2Tree with RoBERTa |
| Mathematical Question Answering | SVAMP | Execution Accuracy | 43.8 | Graph2Tree with RoBERTa |
| Mathematical Question Answering | SVAMP | Accuracy | 41 | GTS with RoBERTa |
| Mathematical Question Answering | SVAMP | Execution Accuracy | 41 | GTS with RoBERTa |
| Mathematical Question Answering | SVAMP | Accuracy | 40.3 | LSTM Seq2Seq with RoBERTa |
| Mathematical Question Answering | SVAMP | Execution Accuracy | 40.3 | LSTM Seq2Seq with RoBERTa |
| Mathematical Question Answering | SVAMP | Accuracy | 38.9 | Transformer with RoBERTa |
| Mathematical Question Answering | SVAMP | Execution Accuracy | 38.9 | Transformer with RoBERTa |
| Mathematical Reasoning | MAWPS | Accuracy (%) | 88.7 | Graph2Tree with RoBERTa |
| Mathematical Reasoning | MAWPS | Accuracy (%) | 88.5 | GTS with RoBERTa |
| Mathematical Reasoning | ASDiv-A | Execution Accuracy | 82.2 | Graph2Tree with RoBERTa |
| Mathematical Reasoning | ASDiv-A | Execution Accuracy | 81.2 | GTS with RoBERTa |
| Mathematical Reasoning | ASDiv-A | Execution Accuracy | 76.9 | LSTM Seq2Seq with RoBERTa |
| Mathematical Reasoning | SVAMP | Accuracy | 43.8 | Graph2Tree with RoBERTa |
| Mathematical Reasoning | SVAMP | Execution Accuracy | 43.8 | Graph2Tree with RoBERTa |
| Mathematical Reasoning | SVAMP | Accuracy | 41 | GTS with RoBERTa |
| Mathematical Reasoning | SVAMP | Execution Accuracy | 41 | GTS with RoBERTa |
| Mathematical Reasoning | SVAMP | Accuracy | 40.3 | LSTM Seq2Seq with RoBERTa |
| Mathematical Reasoning | SVAMP | Execution Accuracy | 40.3 | LSTM Seq2Seq with RoBERTa |
| Mathematical Reasoning | SVAMP | Accuracy | 38.9 | Transformer with RoBERTa |
| Mathematical Reasoning | SVAMP | Execution Accuracy | 38.9 | Transformer with RoBERTa |