An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

Zui Chen, Yezeng Chen, Jiaqi Han, Zhijie Huang, Ji Qi, Yi Zhou

2024-02-23Math Word Problem Solving Automated Theorem Proving Arithmetic Reasoning

Abstract

Large language models (LLMs) are displaying emergent abilities for math reasoning tasks,and there is a growing attention on enhancing the ability of open-source LLMs through supervised fine-tuning (SFT).In this paper, we aim to explore a general data strategy for supervised data to help optimize and expand math reasoning ability.Firstly, we determine the ability boundary of reasoning paths augmentation by identifying these paths' minimal optimal set.Secondly, we validate that different abilities of the model can be cumulatively enhanced by Mix of Minimal Optimal Sets of corresponding types of data, while our models MMOS achieve SOTA performance on series base models under much lower construction costs.Besides, we point out GSM-HARD is not really hard and today's LLMs no longer lack numerical robustness.Also, we provide an Auto Problem Generator for robustness testing and educational applications.Our code and data are publicly available at https://github.com/cyzhh/MMOS.

Results

Task	Dataset	Metric	Value	Model
Question Answering	MATH	Accuracy	63.7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Question Answering	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Question Answering	MATH	Accuracy	55	MMOS-DeepSeekMath-7B(0-shot)
Question Answering	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot)
Question Answering	MATH	Accuracy	49.5	MMOS-CODE-34B(0-shot)
Question Answering	MATH	Parameters (Billions)	34	MMOS-CODE-34B(0-shot)
Question Answering	MATH	Accuracy	44.3	MMOS-CODE-7B(0-shot)
Question Answering	MATH	Parameters (Billions)	7	MMOS-CODE-7B(0-shot)
Question Answering	ASDiv-A	Execution Accuracy	87.6	MMOS-DeepSeekMath-7B(0-shot)
Question Answering	ASDiv-A	Execution Accuracy	85.1	MMOS-CODE-34B(0-shot)
Question Answering	ASDiv-A	Execution Accuracy	78.6	MMOS-CODE-7B(0-shot)
Question Answering	SVAMP	Execution Accuracy	80.6	MMOS-CODE-34B(0-shot)
Question Answering	SVAMP	Execution Accuracy	79.3	MMOS-DeepSeekMath-7B(0-shot)
Question Answering	SVAMP	Execution Accuracy	76.4	MMOS-CODE-7B(0-shot)
Automated Theorem Proving	miniF2F-test	Pass@1	28.3	MMOS-DeepSeekMath-7B
Automated Theorem Proving	miniF2F-test	cumulative	28.3	MMOS-DeepSeekMath-7B
Mathematical Proofs	miniF2F-test	Pass@1	28.3	MMOS-DeepSeekMath-7B
Mathematical Proofs	miniF2F-test	cumulative	28.3	MMOS-DeepSeekMath-7B
Math Word Problem Solving	MATH	Accuracy	63.7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Math Word Problem Solving	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Math Word Problem Solving	MATH	Accuracy	55	MMOS-DeepSeekMath-7B(0-shot)
Math Word Problem Solving	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot)
Math Word Problem Solving	MATH	Accuracy	49.5	MMOS-CODE-34B(0-shot)
Math Word Problem Solving	MATH	Parameters (Billions)	34	MMOS-CODE-34B(0-shot)
Math Word Problem Solving	MATH	Accuracy	44.3	MMOS-CODE-7B(0-shot)
Math Word Problem Solving	MATH	Parameters (Billions)	7	MMOS-CODE-7B(0-shot)
Math Word Problem Solving	ASDiv-A	Execution Accuracy	87.6	MMOS-DeepSeekMath-7B(0-shot)
Math Word Problem Solving	ASDiv-A	Execution Accuracy	85.1	MMOS-CODE-34B(0-shot)
Math Word Problem Solving	ASDiv-A	Execution Accuracy	78.6	MMOS-CODE-7B(0-shot)
Math Word Problem Solving	SVAMP	Execution Accuracy	80.6	MMOS-CODE-34B(0-shot)
Math Word Problem Solving	SVAMP	Execution Accuracy	79.3	MMOS-DeepSeekMath-7B(0-shot)
Math Word Problem Solving	SVAMP	Execution Accuracy	76.4	MMOS-CODE-7B(0-shot)
Mathematical Question Answering	MATH	Accuracy	63.7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Mathematical Question Answering	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Mathematical Question Answering	MATH	Accuracy	55	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Question Answering	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Question Answering	MATH	Accuracy	49.5	MMOS-CODE-34B(0-shot)
Mathematical Question Answering	MATH	Parameters (Billions)	34	MMOS-CODE-34B(0-shot)
Mathematical Question Answering	MATH	Accuracy	44.3	MMOS-CODE-7B(0-shot)
Mathematical Question Answering	MATH	Parameters (Billions)	7	MMOS-CODE-7B(0-shot)
Mathematical Question Answering	ASDiv-A	Execution Accuracy	87.6	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Question Answering	ASDiv-A	Execution Accuracy	85.1	MMOS-CODE-34B(0-shot)
Mathematical Question Answering	ASDiv-A	Execution Accuracy	78.6	MMOS-CODE-7B(0-shot)
Mathematical Question Answering	SVAMP	Execution Accuracy	80.6	MMOS-CODE-34B(0-shot)
Mathematical Question Answering	SVAMP	Execution Accuracy	79.3	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Question Answering	SVAMP	Execution Accuracy	76.4	MMOS-CODE-7B(0-shot)
Mathematical Reasoning	MATH	Accuracy	63.7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Mathematical Reasoning	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Mathematical Reasoning	MATH	Accuracy	55	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Reasoning	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Reasoning	MATH	Accuracy	49.5	MMOS-CODE-34B(0-shot)
Mathematical Reasoning	MATH	Parameters (Billions)	34	MMOS-CODE-34B(0-shot)
Mathematical Reasoning	MATH	Accuracy	44.3	MMOS-CODE-7B(0-shot)
Mathematical Reasoning	MATH	Parameters (Billions)	7	MMOS-CODE-7B(0-shot)
Mathematical Reasoning	ASDiv-A	Execution Accuracy	87.6	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Reasoning	ASDiv-A	Execution Accuracy	85.1	MMOS-CODE-34B(0-shot)
Mathematical Reasoning	ASDiv-A	Execution Accuracy	78.6	MMOS-CODE-7B(0-shot)
Mathematical Reasoning	SVAMP	Execution Accuracy	80.6	MMOS-CODE-34B(0-shot)
Mathematical Reasoning	SVAMP	Execution Accuracy	79.3	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Reasoning	SVAMP	Execution Accuracy	76.4	MMOS-CODE-7B(0-shot)
Arithmetic Reasoning	GSM8K	Accuracy	87.2	MMOS-DeepSeekMath-7B(0-shot,k=50)
Arithmetic Reasoning	GSM8K	Parameters (Billion)	7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Arithmetic Reasoning	GSM8K	Accuracy	80.5	MMOS-DeepSeekMath-7B(0-shot)
Arithmetic Reasoning	GSM8K	Parameters (Billion)	7	MMOS-DeepSeekMath-7B(0-shot)
Arithmetic Reasoning	GSM8K	Accuracy	80.4	MMOS-CODE-34B(0-shot)
Arithmetic Reasoning	GSM8K	Parameters (Billion)	34	MMOS-CODE-34B(0-shot)
Arithmetic Reasoning	GSM8K	Accuracy	73.9	MMOS-CODE-7B(0-shot)
Arithmetic Reasoning	GSM8K	Parameters (Billion)	7	MMOS-CODE-7B(0-shot)

Abstract

Results

Task	Dataset	Metric	Value	Model
Question Answering	MATH	Accuracy	63.7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Question Answering	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Question Answering	MATH	Accuracy	55	MMOS-DeepSeekMath-7B(0-shot)
Question Answering	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot)
Question Answering	MATH	Accuracy	49.5	MMOS-CODE-34B(0-shot)
Question Answering	MATH	Parameters (Billions)	34	MMOS-CODE-34B(0-shot)
Question Answering	MATH	Accuracy	44.3	MMOS-CODE-7B(0-shot)
Question Answering	MATH	Parameters (Billions)	7	MMOS-CODE-7B(0-shot)
Question Answering	ASDiv-A	Execution Accuracy	87.6	MMOS-DeepSeekMath-7B(0-shot)
Question Answering	ASDiv-A	Execution Accuracy	85.1	MMOS-CODE-34B(0-shot)
Question Answering	ASDiv-A	Execution Accuracy	78.6	MMOS-CODE-7B(0-shot)
Question Answering	SVAMP	Execution Accuracy	80.6	MMOS-CODE-34B(0-shot)
Question Answering	SVAMP	Execution Accuracy	79.3	MMOS-DeepSeekMath-7B(0-shot)
Question Answering	SVAMP	Execution Accuracy	76.4	MMOS-CODE-7B(0-shot)
Automated Theorem Proving	miniF2F-test	Pass@1	28.3	MMOS-DeepSeekMath-7B
Automated Theorem Proving	miniF2F-test	cumulative	28.3	MMOS-DeepSeekMath-7B
Mathematical Proofs	miniF2F-test	Pass@1	28.3	MMOS-DeepSeekMath-7B
Mathematical Proofs	miniF2F-test	cumulative	28.3	MMOS-DeepSeekMath-7B
Math Word Problem Solving	MATH	Accuracy	63.7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Math Word Problem Solving	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Math Word Problem Solving	MATH	Accuracy	55	MMOS-DeepSeekMath-7B(0-shot)
Math Word Problem Solving	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot)
Math Word Problem Solving	MATH	Accuracy	49.5	MMOS-CODE-34B(0-shot)
Math Word Problem Solving	MATH	Parameters (Billions)	34	MMOS-CODE-34B(0-shot)
Math Word Problem Solving	MATH	Accuracy	44.3	MMOS-CODE-7B(0-shot)
Math Word Problem Solving	MATH	Parameters (Billions)	7	MMOS-CODE-7B(0-shot)
Math Word Problem Solving	ASDiv-A	Execution Accuracy	87.6	MMOS-DeepSeekMath-7B(0-shot)
Math Word Problem Solving	ASDiv-A	Execution Accuracy	85.1	MMOS-CODE-34B(0-shot)
Math Word Problem Solving	ASDiv-A	Execution Accuracy	78.6	MMOS-CODE-7B(0-shot)
Math Word Problem Solving	SVAMP	Execution Accuracy	80.6	MMOS-CODE-34B(0-shot)
Math Word Problem Solving	SVAMP	Execution Accuracy	79.3	MMOS-DeepSeekMath-7B(0-shot)
Math Word Problem Solving	SVAMP	Execution Accuracy	76.4	MMOS-CODE-7B(0-shot)
Mathematical Question Answering	MATH	Accuracy	63.7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Mathematical Question Answering	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Mathematical Question Answering	MATH	Accuracy	55	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Question Answering	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Question Answering	MATH	Accuracy	49.5	MMOS-CODE-34B(0-shot)
Mathematical Question Answering	MATH	Parameters (Billions)	34	MMOS-CODE-34B(0-shot)
Mathematical Question Answering	MATH	Accuracy	44.3	MMOS-CODE-7B(0-shot)
Mathematical Question Answering	MATH	Parameters (Billions)	7	MMOS-CODE-7B(0-shot)
Mathematical Question Answering	ASDiv-A	Execution Accuracy	87.6	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Question Answering	ASDiv-A	Execution Accuracy	85.1	MMOS-CODE-34B(0-shot)
Mathematical Question Answering	ASDiv-A	Execution Accuracy	78.6	MMOS-CODE-7B(0-shot)
Mathematical Question Answering	SVAMP	Execution Accuracy	80.6	MMOS-CODE-34B(0-shot)
Mathematical Question Answering	SVAMP	Execution Accuracy	79.3	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Question Answering	SVAMP	Execution Accuracy	76.4	MMOS-CODE-7B(0-shot)
Mathematical Reasoning	MATH	Accuracy	63.7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Mathematical Reasoning	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Mathematical Reasoning	MATH	Accuracy	55	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Reasoning	MATH	Parameters (Billions)	7	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Reasoning	MATH	Accuracy	49.5	MMOS-CODE-34B(0-shot)
Mathematical Reasoning	MATH	Parameters (Billions)	34	MMOS-CODE-34B(0-shot)
Mathematical Reasoning	MATH	Accuracy	44.3	MMOS-CODE-7B(0-shot)
Mathematical Reasoning	MATH	Parameters (Billions)	7	MMOS-CODE-7B(0-shot)
Mathematical Reasoning	ASDiv-A	Execution Accuracy	87.6	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Reasoning	ASDiv-A	Execution Accuracy	85.1	MMOS-CODE-34B(0-shot)
Mathematical Reasoning	ASDiv-A	Execution Accuracy	78.6	MMOS-CODE-7B(0-shot)
Mathematical Reasoning	SVAMP	Execution Accuracy	80.6	MMOS-CODE-34B(0-shot)
Mathematical Reasoning	SVAMP	Execution Accuracy	79.3	MMOS-DeepSeekMath-7B(0-shot)
Mathematical Reasoning	SVAMP	Execution Accuracy	76.4	MMOS-CODE-7B(0-shot)
Arithmetic Reasoning	GSM8K	Accuracy	87.2	MMOS-DeepSeekMath-7B(0-shot,k=50)
Arithmetic Reasoning	GSM8K	Parameters (Billion)	7	MMOS-DeepSeekMath-7B(0-shot,k=50)
Arithmetic Reasoning	GSM8K	Accuracy	80.5	MMOS-DeepSeekMath-7B(0-shot)
Arithmetic Reasoning	GSM8K	Parameters (Billion)	7	MMOS-DeepSeekMath-7B(0-shot)
Arithmetic Reasoning	GSM8K	Accuracy	80.4	MMOS-CODE-34B(0-shot)
Arithmetic Reasoning	GSM8K	Parameters (Billion)	34	MMOS-CODE-34B(0-shot)
Arithmetic Reasoning	GSM8K	Accuracy	73.9	MMOS-CODE-7B(0-shot)
Arithmetic Reasoning	GSM8K	Parameters (Billion)	7	MMOS-CODE-7B(0-shot)

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

Abstract

Results

Related Papers

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

Abstract

Results

Related Papers