Lila: A Unified Benchmark for Mathematical Reasoning

Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan

2022-10-31Question Answering Mathematical Reasoning

Paper PDF Code(official)

Abstract

Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.

Results

Task	Dataset	Metric	Value	Model
Mathematical Reasoning	Lila (OOD)	Accuracy	0.586	Codex (Few-Shot, 175B)
Mathematical Reasoning	Lila (OOD)	Accuracy	0.448	Bhāskara-P (Fine-tuned, 2.7B)
Mathematical Reasoning	Lila (OOD)	Accuracy	0.384	GPT-3 (Few-Shot, 175B)
Mathematical Reasoning	Lila (OOD)	Accuracy	0.268	Bhāskara-A (Fine-tuned, 2.7B)
Mathematical Reasoning	Lila (OOD)	Accuracy	0.238	Neo-P (Fine-tuned, 2.7B)
Mathematical Reasoning	Lila (OOD)	Accuracy	0.177	Neo-A (Fine-tuned, 2.7B)
Mathematical Reasoning	Lila (IID)	Accuracy	0.604	Codex (Few-Shot, 175B)
Mathematical Reasoning	Lila (IID)	Accuracy	0.48	Bhāskara-P (Fine-tuned, 2.7B)
Mathematical Reasoning	Lila (IID)	Accuracy	0.394	Neo-P (Fine-tuned, 2.7B)
Mathematical Reasoning	Lila (IID)	Accuracy	0.384	GPT-3 (Few-Shot, 175B)
Mathematical Reasoning	Lila (IID)	Accuracy	0.252	Bhāskara-A (Fine-tuned, 2.7B)
Mathematical Reasoning	Lila (IID)	Accuracy	0.204	Neo-A (Fine-tuned, 2.7B)

Lila: A Unified Benchmark for Mathematical Reasoning

Abstract

Results

Related Papers

Lila: A Unified Benchmark for Mathematical Reasoning

Abstract

Results

Related Papers