TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DART-Math: Difficulty-Aware Rejection Tuning for Mathemati...

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He

2024-06-18MathMath Word Problem SolvingNatural QuestionsMathematical Problem-SolvingArithmetic Reasoning
PaperPDFCode(official)

Abstract

Solving mathematical problems requires advanced reasoning abilities and presents notable challenges for large language models. Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries. Hypothesizing that difficult queries are crucial to learn complex reasoning, we propose Difficulty-Aware Rejection Tuning (DART), a method that allocates difficult queries more trials during the synthesis phase, enabling more extensive training on difficult samples. Utilizing DART, we have created new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones. Remarkably, our synthesis process solely relies on a 7B-sized open-weight model, without reliance on the commonly used proprietary GPT-4. We fine-tune various base models on our datasets ranging from 7B to 70B in size, resulting in a series of strong models called DART-MATH. In comprehensive in-domain and out-of-domain evaluation on 6 mathematical benchmarks, DART-MATH outperforms vanilla rejection tuning significantly, being superior or comparable to previous arts, despite using much smaller datasets and no proprietary models. Furthermore, our results position our synthetic datasets as the most effective and cost-efficient publicly available resources for advancing mathematical problem-solving.

Results

TaskDatasetMetricValueModel
Question AnsweringMATHAccuracy56.1DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Question AnsweringMATHParameters (Billions)70DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Question AnsweringMATHAccuracy54.9DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Question AnsweringMATHParameters (Billions)70DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Question AnsweringMATHAccuracy53.6DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Question AnsweringMATHParameters (Billions)7DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Question AnsweringMATHAccuracy52.9DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Question AnsweringMATHParameters (Billions)7DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Question AnsweringMATHAccuracy46.6DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Question AnsweringMATHParameters (Billions)8DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Question AnsweringMATHAccuracy45.5DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Question AnsweringMATHParameters (Billions)7DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Question AnsweringMATHAccuracy45.3DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Question AnsweringMATHParameters (Billions)8DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Question AnsweringMATHAccuracy43.5DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Question AnsweringMATHParameters (Billions)7DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Math Word Problem SolvingMATHAccuracy56.1DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Math Word Problem SolvingMATHParameters (Billions)70DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Math Word Problem SolvingMATHAccuracy54.9DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Math Word Problem SolvingMATHParameters (Billions)70DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Math Word Problem SolvingMATHAccuracy53.6DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Math Word Problem SolvingMATHParameters (Billions)7DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Math Word Problem SolvingMATHAccuracy52.9DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Math Word Problem SolvingMATHParameters (Billions)7DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Math Word Problem SolvingMATHAccuracy46.6DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Math Word Problem SolvingMATHParameters (Billions)8DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Math Word Problem SolvingMATHAccuracy45.5DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Math Word Problem SolvingMATHParameters (Billions)7DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Math Word Problem SolvingMATHAccuracy45.3DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Math Word Problem SolvingMATHParameters (Billions)8DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Math Word Problem SolvingMATHAccuracy43.5DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Math Word Problem SolvingMATHParameters (Billions)7DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHAccuracy56.1DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHParameters (Billions)70DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHAccuracy54.9DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHParameters (Billions)70DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHAccuracy53.6DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHParameters (Billions)7DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHAccuracy52.9DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHParameters (Billions)7DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHAccuracy46.6DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHParameters (Billions)8DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHAccuracy45.5DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHParameters (Billions)7DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHAccuracy45.3DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHParameters (Billions)8DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHAccuracy43.5DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Mathematical Question AnsweringMATHParameters (Billions)7DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Mathematical ReasoningMATHAccuracy56.1DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Mathematical ReasoningMATHParameters (Billions)70DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Mathematical ReasoningMATHAccuracy54.9DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Mathematical ReasoningMATHParameters (Billions)70DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Mathematical ReasoningMATHAccuracy53.6DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Mathematical ReasoningMATHParameters (Billions)7DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Mathematical ReasoningMATHAccuracy52.9DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Mathematical ReasoningMATHParameters (Billions)7DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Mathematical ReasoningMATHAccuracy46.6DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Mathematical ReasoningMATHParameters (Billions)8DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Mathematical ReasoningMATHAccuracy45.5DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Mathematical ReasoningMATHParameters (Billions)7DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Mathematical ReasoningMATHAccuracy45.3DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Mathematical ReasoningMATHParameters (Billions)8DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Mathematical ReasoningMATHAccuracy43.5DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Mathematical ReasoningMATHParameters (Billions)7DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KAccuracy90.4DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KParameters (Billion)70DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KAccuracy89.6DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KParameters (Billion)70DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KAccuracy88.2DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KParameters (Billion)7DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KAccuracy86.8DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KParameters (Billion)7DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KAccuracy82.6DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KParameters (Billion)7DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KAccuracy82.5DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KParameters (Billion)8DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KAccuracy81.1DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KParameters (Billion)7DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KAccuracy81.1DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Arithmetic ReasoningGSM8KParameters (Billion)8DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
General KnowledgeTheoremQAAccuracy32.5DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
General KnowledgeTheoremQAAccuracy32.2DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
General KnowledgeTheoremQAAccuracy28.2DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
General KnowledgeTheoremQAAccuracy27.4DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
General KnowledgeTheoremQAAccuracy19.4DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
General KnowledgeTheoremQAAccuracy17DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
General KnowledgeTheoremQAAccuracy16.4DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
General KnowledgeTheoremQAAccuracy15.4DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing2025-07-15DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination2025-07-14A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning2025-07-11