TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Lila: A Unified Benchmark for Mathematical Reasoning

Lila: A Unified Benchmark for Mathematical Reasoning

Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan

2022-10-31Question AnsweringMathematical Reasoning
PaperPDFCode(official)

Abstract

Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.

Results

TaskDatasetMetricValueModel
Mathematical ReasoningLila (OOD)Accuracy0.586Codex (Few-Shot, 175B)
Mathematical ReasoningLila (OOD)Accuracy0.448Bhāskara-P (Fine-tuned, 2.7B)
Mathematical ReasoningLila (OOD)Accuracy0.384GPT-3 (Few-Shot, 175B)
Mathematical ReasoningLila (OOD)Accuracy0.268Bhāskara-A (Fine-tuned, 2.7B)
Mathematical ReasoningLila (OOD)Accuracy0.238Neo-P (Fine-tuned, 2.7B)
Mathematical ReasoningLila (OOD)Accuracy0.177Neo-A (Fine-tuned, 2.7B)
Mathematical ReasoningLila (IID)Accuracy0.604Codex (Few-Shot, 175B)
Mathematical ReasoningLila (IID)Accuracy0.48Bhāskara-P (Fine-tuned, 2.7B)
Mathematical ReasoningLila (IID)Accuracy0.394Neo-P (Fine-tuned, 2.7B)
Mathematical ReasoningLila (IID)Accuracy0.384GPT-3 (Few-Shot, 175B)
Mathematical ReasoningLila (IID)Accuracy0.252Bhāskara-A (Fine-tuned, 2.7B)
Mathematical ReasoningLila (IID)Accuracy0.204Neo-A (Fine-tuned, 2.7B)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16A Survey of Deep Learning for Geometry Problem Solving2025-07-16