OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman

2024-10-02Mathematical Reasoning Math Math Word Problem Solving Large Language Model Arithmetic Reasoning

Abstract

Mathematical reasoning continues to be a critical challenge in large language model (LLM) development with significant interest. However, most of the cutting-edge progress in mathematical reasoning with LLMs has become \emph{closed-source} due to lack of access to training data. This lack of data access limits researchers from understanding the impact of different choices for synthesizing and utilizing the data. With the goal of creating a high-quality finetuning (SFT) dataset for math reasoning, we conduct careful ablation experiments on data synthesis using the recently released \texttt{Llama3.1} family of models. Our experiments show that: (a) solution format matters, with excessively verbose solutions proving detrimental to SFT performance, (b) data generated by a strong teacher outperforms equally-sized data generated by a weak student model, (c) SFT is robust to low-quality solutions, allowing for imprecise data filtering, and (d) question diversity is crucial for achieving data scaling gains. Based on these insights, we create the OpenMathInstruct-2 dataset, which consists of 14M question-solution pairs ($\approx$ 600K unique questions), making it nearly eight times larger than the previous largest open-source math reasoning dataset. Finetuning the \texttt{Llama-3.1-8B-Base} using OpenMathInstruct-2 outperforms \texttt{Llama3.1-8B-Instruct} on MATH by an absolute 15.9\% (51.9\% $\rightarrow$ 67.8\%). Finally, to accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.

Results

Task	Dataset	Metric	Value	Model
Question Answering	MATH	Accuracy	79.6	OpenMath2-Llama3.1-70B (majority@256)
Question Answering	MATH	Accuracy	76.1	OpenMath2-Llama3.1-8B (majority@256)
Question Answering	MATH	Accuracy	71.9	OpenMath2-Llama3.1-70B
Question Answering	MATH	Accuracy	67.8	OpenMath2-Llama3.1-8B
Math Word Problem Solving	MATH	Accuracy	79.6	OpenMath2-Llama3.1-70B (majority@256)
Math Word Problem Solving	MATH	Accuracy	76.1	OpenMath2-Llama3.1-8B (majority@256)
Math Word Problem Solving	MATH	Accuracy	71.9	OpenMath2-Llama3.1-70B
Math Word Problem Solving	MATH	Accuracy	67.8	OpenMath2-Llama3.1-8B
Mathematical Question Answering	MATH	Accuracy	79.6	OpenMath2-Llama3.1-70B (majority@256)
Mathematical Question Answering	MATH	Accuracy	76.1	OpenMath2-Llama3.1-8B (majority@256)
Mathematical Question Answering	MATH	Accuracy	71.9	OpenMath2-Llama3.1-70B
Mathematical Question Answering	MATH	Accuracy	67.8	OpenMath2-Llama3.1-8B
Mathematical Reasoning	MATH	Accuracy	79.6	OpenMath2-Llama3.1-70B (majority@256)
Mathematical Reasoning	MATH	Accuracy	76.1	OpenMath2-Llama3.1-8B (majority@256)
Mathematical Reasoning	MATH	Accuracy	71.9	OpenMath2-Llama3.1-70B
Mathematical Reasoning	MATH	Accuracy	67.8	OpenMath2-Llama3.1-8B
Arithmetic Reasoning	GSM8K	Accuracy	96	OpenMath2-Llama3.1-70B (majority@256)
Arithmetic Reasoning	GSM8K	Accuracy	94.9	OpenMath2-Llama3.1-70B
Arithmetic Reasoning	GSM8K	Accuracy	94.1	OpenMath2-Llama3.1-8B (majority@256)
Arithmetic Reasoning	GSM8K	Accuracy	91.7	OpenMath2-Llama3.1-8B

Abstract

Results

Task	Dataset	Metric	Value	Model
Question Answering	MATH	Accuracy	79.6	OpenMath2-Llama3.1-70B (majority@256)
Question Answering	MATH	Accuracy	76.1	OpenMath2-Llama3.1-8B (majority@256)
Question Answering	MATH	Accuracy	71.9	OpenMath2-Llama3.1-70B
Question Answering	MATH	Accuracy	67.8	OpenMath2-Llama3.1-8B
Math Word Problem Solving	MATH	Accuracy	79.6	OpenMath2-Llama3.1-70B (majority@256)
Math Word Problem Solving	MATH	Accuracy	76.1	OpenMath2-Llama3.1-8B (majority@256)
Math Word Problem Solving	MATH	Accuracy	71.9	OpenMath2-Llama3.1-70B
Math Word Problem Solving	MATH	Accuracy	67.8	OpenMath2-Llama3.1-8B
Mathematical Question Answering	MATH	Accuracy	79.6	OpenMath2-Llama3.1-70B (majority@256)
Mathematical Question Answering	MATH	Accuracy	76.1	OpenMath2-Llama3.1-8B (majority@256)
Mathematical Question Answering	MATH	Accuracy	71.9	OpenMath2-Llama3.1-70B
Mathematical Question Answering	MATH	Accuracy	67.8	OpenMath2-Llama3.1-8B
Mathematical Reasoning	MATH	Accuracy	79.6	OpenMath2-Llama3.1-70B (majority@256)
Mathematical Reasoning	MATH	Accuracy	76.1	OpenMath2-Llama3.1-8B (majority@256)
Mathematical Reasoning	MATH	Accuracy	71.9	OpenMath2-Llama3.1-70B
Mathematical Reasoning	MATH	Accuracy	67.8	OpenMath2-Llama3.1-8B
Arithmetic Reasoning	GSM8K	Accuracy	96	OpenMath2-Llama3.1-70B (majority@256)
Arithmetic Reasoning	GSM8K	Accuracy	94.9	OpenMath2-Llama3.1-70B
Arithmetic Reasoning	GSM8K	Accuracy	94.1	OpenMath2-Llama3.1-8B (majority@256)
Arithmetic Reasoning	GSM8K	Accuracy	91.7	OpenMath2-Llama3.1-8B

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

Abstract

Results

Related Papers

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

Abstract

Results

Related Papers