TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/DART

DART

Difficulty-Aware Rejection Tuning

Natural Language ProcessingIntroduced 200032 papers

Description

🎯 DART-Math

Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

📝 Paper@arXiv | 🤗 Datasets&Models@HF | 🐱 Code@GitHub

🐦 Thread@X(Twitter) | 🐶 中文博客@知乎 | 📊 Leaderboard@PapersWithCode | 📑 BibTeX

Datasets: DART-Math

DART-Math datasets are the state-of-the-art and data-efficient open-source instruction tuning datasets for mathematical reasoning.

DART-Math-Hard contains ~585k mathematical QA pair samples constructed by applying DARS-Prop2Diff to the query set from MATH and GSK8K training sets, achieves SOTA on many challenging mathematical reasoning benchmarks. It introduces a deliberate bias towards hard queries, opposite to vanilla rejection sampling.

Performance produced by DART-Math-Hard is usually but not necessarily slightly better (~1% absolutely) than DART-Math-Uniform, which contains ~591k samples constructed by applying DARS-Uniform.

Comparison between Mathematical Instruction Tuning Datasets

Most of previous datasets are constructed with ChatGPT, and many of them are not open-source, especially for ones of the best performance.

| Math SFT Dataset | # of Samples | MATH | GSM8K | College | Synthesis Agent(s) | Open-Source | | :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :---------------------- | :-------------------------------------------------------------------------: | | WizardMath | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ | | MetaMathQA | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | ✓ | | MMIQC | 2294k | 37.4 | 75.4 | 28.5 | GPT-4+GPT-3.5+Human | ✓ | | Orca-Math | 200k | -- | -- | -- | GPT-4 | ✓ | | Xwin-Math-V1.1 | 1440k | 45.5 | 84.9 | 27.6 | GPT-4 | ✗ | | KPMath-Plus | 1576k | 46.8 | 82.1 | -– | GPT-4 | ✗ | | MathScaleQA | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ | | DART-Math-Uniform | 591k | 43.5 | 82.6 | 26.9 | DeepSeekMath-7B-RL | ✓ | | DART-Math-Hard | 585k | 45.5 | 81.1 | 29.4 | DeepSeekMath-7B-RL | ✓ |

<sup>MATH and GSM8K are in-domain, while College(Math) is out-of-domain. Performance here are of models fine-tuned from Mistral-7B, except for Xwin-Math-V1.1 based on Llama2-7B. Bold/Italic means the best/second best score here.</sup>

Dataset Construction: DARS - Difficulty-Aware Rejection Sampling

Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries.

Motivated by the observation above, we propose to Difficulty-Aware Rejection Sampling (DARS), to collect more responses for more difficult queries. Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:

  1. Uniform, which involves sampling responses for each query until each query accumulates kuk_uku​ correct responses, where kuk_uku​ is a preset hyperparameter determined by the desired size of the synthetic dataset;
  2. Prop2Diff, where we continue sampling responses until the number of correct responses for each query is proportional to its difficulty score. The most challenging queries will receive kpk_pkp​ responses and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to vanilla rejection sampling, towards more difficult queries, inspired by previous works that demonstrate difficult samples can be more effective to enhance model capabilities (Sorscher et al., 2022; Liu et al., 2024b).

See Figure 1 (Right) for examples of DART-Math-Uniform by DARS-Uniform and DART-Math-Hard by DARS-Prop2Diff.

Citation

If you find our data, model or code useful for your work, please kindly cite our paper:

@article{tong2024dartmath,
  title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},
  author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},
  year={2024},
  eprint={2407.13690},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2407.13690},
}

Papers Using This Method

DART: Distilling Autoregressive Reasoning to Silent Thought2025-06-13DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba2025-06-12TablePilot: Recommending Human-Preferred Tabular Data Analysis with Large Language Models2025-03-17dARt Vinci: Egocentric Data Collection for Surgical Robot Learning at Scale2025-03-07Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More2025-02-17Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints2025-01-14Iterative Encoding-Decoding VAEs Anomaly Detection in NOAA's DART Time Series: A Machine Learning Approach for Enhancing Data Integrity for NASA's GRACE-FO Verification and Validation2024-12-20From Point to probabilistic gradient boosting for claim frequency and severity prediction2024-12-19DART: An AIGT Detector using AMR of Rephrased Text2024-12-16An Extensive Evaluation of Factual Consistency in Large Language Models for Data-to-Text Generation2024-11-28Label Distribution Shift-Aware Prediction Refinement for Test-Time Adaptation2024-11-20Jal Anveshak: Prediction of fishing zones using fine-tuned LlaMa 22024-11-15DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation2024-10-10DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control2024-10-07DiSK: Differentially Private Optimizer with Simplified Kalman Filter for Noise Reduction2024-10-04Impact of Model Size on Fine-tuned LLM Performance in Data-to-Text Generation: A State-of-the-Art Investigation2024-07-19DART: An Automated End-to-End Object Detection Pipeline with Data Diversification, Open-Vocabulary Bounding Box Annotation, Pseudo-Label Review, and Model Training2024-07-12Automated Progressive Red Teaming2024-07-04DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving2024-06-18Learning to Play Atari in a World of Tokens2024-06-03