TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MuggleMath: Assessing the Impact of Query and Response Aug...

MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning

Chengpeng Li, Zheng Yuan, Hongyi Yuan, Guanting Dong, Keming Lu, Jiancan Wu, Chuanqi Tan, Xiang Wang, Chang Zhou

2023-10-09Mathematical ReasoningMathMath Word Problem SolvingData AugmentationGSM8KArithmetic Reasoning
PaperPDFCode(official)

Abstract

In math reasoning with large language models (LLMs), fine-tuning data augmentation by query evolution and diverse reasoning paths is empirically verified effective, profoundly narrowing the gap between open-sourced LLMs and cutting-edge proprietary LLMs. In this paper, we conduct an investigation for such data augmentation in math reasoning and are intended to answer: (1) What strategies of data augmentation are more effective; (2) What is the scaling relationship between the amount of augmented data and model performance; and (3) Can data augmentation incentivize generalization to out-of-domain mathematical reasoning tasks? To this end, we create two new dataset AugGSM8K and AugMATH, by complicating and diversifying the queries and sampling multiple reasoning paths from GSM8K and MATH. We obtained a series of LLMs called MuggleMath by fine-tuning LLaMA models on AugGSM8K and AugMATH. MuggleMath substantially achieves new state-of-the-art on GSM8K and MATH. A log-linear relationship and a segmented log-linear are presented between MuggleMath's performance and the amount of augmented data on GSM8K and MATH, respectively. We also find that it is weak in out-of-domain math reasoning generalization from AugGSM8K to MATH and from AugMATH to GSM8K, which suggests that augmenting queries that cover a broader range of subjects is more beneficial for generalization. We release our codes and augmented data in https://github.com/OFA-Sys/gsm8k-ScRel.

Results

TaskDatasetMetricValueModel
Question AnsweringMATHAccuracy35.6MuggleMATH-70B
Question AnsweringMATHParameters (Billions)70MuggleMATH-70B
Question AnsweringMATHAccuracy30.7MuggleMATH-13B
Question AnsweringMATHParameters (Billions)13MuggleMATH-13B
Question AnsweringMATHAccuracy25.8MuggleMATH 7B
Question AnsweringMATHParameters (Billions)7MuggleMATH 7B
Math Word Problem SolvingMATHAccuracy35.6MuggleMATH-70B
Math Word Problem SolvingMATHParameters (Billions)70MuggleMATH-70B
Math Word Problem SolvingMATHAccuracy30.7MuggleMATH-13B
Math Word Problem SolvingMATHParameters (Billions)13MuggleMATH-13B
Math Word Problem SolvingMATHAccuracy25.8MuggleMATH 7B
Math Word Problem SolvingMATHParameters (Billions)7MuggleMATH 7B
Mathematical Question AnsweringMATHAccuracy35.6MuggleMATH-70B
Mathematical Question AnsweringMATHParameters (Billions)70MuggleMATH-70B
Mathematical Question AnsweringMATHAccuracy30.7MuggleMATH-13B
Mathematical Question AnsweringMATHParameters (Billions)13MuggleMATH-13B
Mathematical Question AnsweringMATHAccuracy25.8MuggleMATH 7B
Mathematical Question AnsweringMATHParameters (Billions)7MuggleMATH 7B
Mathematical ReasoningMATHAccuracy35.6MuggleMATH-70B
Mathematical ReasoningMATHParameters (Billions)70MuggleMATH-70B
Mathematical ReasoningMATHAccuracy30.7MuggleMATH-13B
Mathematical ReasoningMATHParameters (Billions)13MuggleMATH-13B
Mathematical ReasoningMATHAccuracy25.8MuggleMATH 7B
Mathematical ReasoningMATHParameters (Billions)7MuggleMATH 7B
Arithmetic ReasoningGSM8KAccuracy82.3MuggleMATH 70B
Arithmetic ReasoningGSM8KParameters (Billion)70MuggleMATH 70B
Arithmetic ReasoningGSM8KAccuracy74MuggleMATH 13B
Arithmetic ReasoningGSM8KParameters (Billion)13MuggleMATH 13B
Arithmetic ReasoningGSM8KAccuracy69.8MuggleMATH 7B
Arithmetic ReasoningGSM8KParameters (Billion)7MuggleMATH 7B

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems2025-07-17A Survey of Deep Learning for Geometry Problem Solving2025-07-16Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16