TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/OpenMathInstruct-2: Accelerating AI for Math with Massive ...

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman

2024-10-02Mathematical ReasoningMathMath Word Problem SolvingLarge Language ModelArithmetic Reasoning
PaperPDFCode

Abstract

Mathematical reasoning continues to be a critical challenge in large language model (LLM) development with significant interest. However, most of the cutting-edge progress in mathematical reasoning with LLMs has become \emph{closed-source} due to lack of access to training data. This lack of data access limits researchers from understanding the impact of different choices for synthesizing and utilizing the data. With the goal of creating a high-quality finetuning (SFT) dataset for math reasoning, we conduct careful ablation experiments on data synthesis using the recently released \texttt{Llama3.1} family of models. Our experiments show that: (a) solution format matters, with excessively verbose solutions proving detrimental to SFT performance, (b) data generated by a strong teacher outperforms equally-sized data generated by a weak student model, (c) SFT is robust to low-quality solutions, allowing for imprecise data filtering, and (d) question diversity is crucial for achieving data scaling gains. Based on these insights, we create the OpenMathInstruct-2 dataset, which consists of 14M question-solution pairs ($\approx$ 600K unique questions), making it nearly eight times larger than the previous largest open-source math reasoning dataset. Finetuning the \texttt{Llama-3.1-8B-Base} using OpenMathInstruct-2 outperforms \texttt{Llama3.1-8B-Instruct} on MATH by an absolute 15.9\% (51.9\% $\rightarrow$ 67.8\%). Finally, to accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.

Results

TaskDatasetMetricValueModel
Question AnsweringMATHAccuracy79.6OpenMath2-Llama3.1-70B (majority@256)
Question AnsweringMATHAccuracy76.1OpenMath2-Llama3.1-8B (majority@256)
Question AnsweringMATHAccuracy71.9OpenMath2-Llama3.1-70B
Question AnsweringMATHAccuracy67.8OpenMath2-Llama3.1-8B
Math Word Problem SolvingMATHAccuracy79.6OpenMath2-Llama3.1-70B (majority@256)
Math Word Problem SolvingMATHAccuracy76.1OpenMath2-Llama3.1-8B (majority@256)
Math Word Problem SolvingMATHAccuracy71.9OpenMath2-Llama3.1-70B
Math Word Problem SolvingMATHAccuracy67.8OpenMath2-Llama3.1-8B
Mathematical Question AnsweringMATHAccuracy79.6OpenMath2-Llama3.1-70B (majority@256)
Mathematical Question AnsweringMATHAccuracy76.1OpenMath2-Llama3.1-8B (majority@256)
Mathematical Question AnsweringMATHAccuracy71.9OpenMath2-Llama3.1-70B
Mathematical Question AnsweringMATHAccuracy67.8OpenMath2-Llama3.1-8B
Mathematical ReasoningMATHAccuracy79.6OpenMath2-Llama3.1-70B (majority@256)
Mathematical ReasoningMATHAccuracy76.1OpenMath2-Llama3.1-8B (majority@256)
Mathematical ReasoningMATHAccuracy71.9OpenMath2-Llama3.1-70B
Mathematical ReasoningMATHAccuracy67.8OpenMath2-Llama3.1-8B
Arithmetic ReasoningGSM8KAccuracy96OpenMath2-Llama3.1-70B (majority@256)
Arithmetic ReasoningGSM8KAccuracy94.9OpenMath2-Llama3.1-70B
Arithmetic ReasoningGSM8KAccuracy94.1OpenMath2-Llama3.1-8B (majority@256)
Arithmetic ReasoningGSM8KAccuracy91.7OpenMath2-Llama3.1-8B

Related Papers

DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17A Survey of Deep Learning for Geometry Problem Solving2025-07-16