Augmenting Math Word Problems via Iterative Question Composing

Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao

2024-01-17Mathematical Reasoning Math Math Word Problem Solving

Abstract

Despite the advancements in large language models (LLMs) for mathematical reasoning, solving competition-level math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base language models. Models fine-tuned on MMIQC consistently surpass their counterparts in performance on the MATH benchmark across various model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-of-the-art by 8.2% and outperforming the initial version GPT-4 released in 2023. Extensive evaluation results on Hungarian high school finals suggest that such improvement can generalize to unseen data. Our ablation study on MMIQC reveals that a large part of the improvement can be attributed to our novel augmentation method, Iterative Question Composing (IQC), which involves iteratively composing new questions from seed problems using an LLM and applying rejection sampling through another LLM.

Results

Task	Dataset	Metric	Value	Model
Question Answering	MATH	Accuracy	45	MMIQC-72B
Question Answering	MATH	Parameters (Billions)	72	MMIQC-72B
Math Word Problem Solving	MATH	Accuracy	45	MMIQC-72B
Math Word Problem Solving	MATH	Parameters (Billions)	72	MMIQC-72B
Mathematical Question Answering	MATH	Accuracy	45	MMIQC-72B
Mathematical Question Answering	MATH	Parameters (Billions)	72	MMIQC-72B
Mathematical Reasoning	MATH	Accuracy	45	MMIQC-72B
Mathematical Reasoning	MATH	Parameters (Billions)	72	MMIQC-72B

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17 QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17 A Survey of Deep Learning for Geometry Problem Solving2025-07-16 Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16 KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?2025-07-15 Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15 Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing2025-07-15 Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination2025-07-14