TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Qwen2.5-Math Technical Report: Toward Mathematical Expert ...

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, Zhenru Zhang

2024-09-18Mathematical ReasoningMathMath Word Problem SolvingPhilosophyGSM8K
PaperPDF

Abstract

In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it's possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems.

Results

TaskDatasetMetricValueModel
Question AnsweringMATHAccuracy88.1Qwen2.5-Math-72B-Instruct(TIR,Greedy)
Question AnsweringMATHParameters (Billions)72Qwen2.5-Math-72B-Instruct(TIR,Greedy)
Question AnsweringMATHAccuracy85.9Qwen2.5-Math-72B-Instruct(COT,Greedy)
Question AnsweringMATHParameters (Billions)72Qwen2.5-Math-72B-Instruct(COT,Greedy)
Question AnsweringMATHAccuracy85.2Qwen2.5-Math-7B-Instruct(TIR,Greedy)
Question AnsweringMATHParameters (Billions)7Qwen2.5-Math-7B-Instruct(TIR,Greedy)
Question AnsweringMATHAccuracy83.6Qwen2.5-Math-7B-Instruct(COT,Greedy)
Question AnsweringMATHParameters (Billions)7Qwen2.5-Math-7B-Instruct(COT,Greedy)
Question AnsweringMATHAccuracy79.9Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)
Question AnsweringMATHParameters (Billions)1.5Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)
Question AnsweringMATHAccuracy75.8Qwen2.5-Math-1.5B-Instruct(COT,Greedy)
Question AnsweringMATHParameters (Billions)1.5Qwen2.5-Math-1.5B-Instruct(COT,Greedy)
Math Word Problem SolvingMATHAccuracy88.1Qwen2.5-Math-72B-Instruct(TIR,Greedy)
Math Word Problem SolvingMATHParameters (Billions)72Qwen2.5-Math-72B-Instruct(TIR,Greedy)
Math Word Problem SolvingMATHAccuracy85.9Qwen2.5-Math-72B-Instruct(COT,Greedy)
Math Word Problem SolvingMATHParameters (Billions)72Qwen2.5-Math-72B-Instruct(COT,Greedy)
Math Word Problem SolvingMATHAccuracy85.2Qwen2.5-Math-7B-Instruct(TIR,Greedy)
Math Word Problem SolvingMATHParameters (Billions)7Qwen2.5-Math-7B-Instruct(TIR,Greedy)
Math Word Problem SolvingMATHAccuracy83.6Qwen2.5-Math-7B-Instruct(COT,Greedy)
Math Word Problem SolvingMATHParameters (Billions)7Qwen2.5-Math-7B-Instruct(COT,Greedy)
Math Word Problem SolvingMATHAccuracy79.9Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)
Math Word Problem SolvingMATHParameters (Billions)1.5Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)
Math Word Problem SolvingMATHAccuracy75.8Qwen2.5-Math-1.5B-Instruct(COT,Greedy)
Math Word Problem SolvingMATHParameters (Billions)1.5Qwen2.5-Math-1.5B-Instruct(COT,Greedy)
Mathematical Question AnsweringMATHAccuracy88.1Qwen2.5-Math-72B-Instruct(TIR,Greedy)
Mathematical Question AnsweringMATHParameters (Billions)72Qwen2.5-Math-72B-Instruct(TIR,Greedy)
Mathematical Question AnsweringMATHAccuracy85.9Qwen2.5-Math-72B-Instruct(COT,Greedy)
Mathematical Question AnsweringMATHParameters (Billions)72Qwen2.5-Math-72B-Instruct(COT,Greedy)
Mathematical Question AnsweringMATHAccuracy85.2Qwen2.5-Math-7B-Instruct(TIR,Greedy)
Mathematical Question AnsweringMATHParameters (Billions)7Qwen2.5-Math-7B-Instruct(TIR,Greedy)
Mathematical Question AnsweringMATHAccuracy83.6Qwen2.5-Math-7B-Instruct(COT,Greedy)
Mathematical Question AnsweringMATHParameters (Billions)7Qwen2.5-Math-7B-Instruct(COT,Greedy)
Mathematical Question AnsweringMATHAccuracy79.9Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)
Mathematical Question AnsweringMATHParameters (Billions)1.5Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)
Mathematical Question AnsweringMATHAccuracy75.8Qwen2.5-Math-1.5B-Instruct(COT,Greedy)
Mathematical Question AnsweringMATHParameters (Billions)1.5Qwen2.5-Math-1.5B-Instruct(COT,Greedy)
Mathematical ReasoningAMC23Acc62.5Qwen2.5-Math-7B-instruct
Mathematical ReasoningMATHAccuracy88.1Qwen2.5-Math-72B-Instruct(TIR,Greedy)
Mathematical ReasoningMATHParameters (Billions)72Qwen2.5-Math-72B-Instruct(TIR,Greedy)
Mathematical ReasoningMATHAccuracy85.9Qwen2.5-Math-72B-Instruct(COT,Greedy)
Mathematical ReasoningMATHParameters (Billions)72Qwen2.5-Math-72B-Instruct(COT,Greedy)
Mathematical ReasoningMATHAccuracy85.2Qwen2.5-Math-7B-Instruct(TIR,Greedy)
Mathematical ReasoningMATHParameters (Billions)7Qwen2.5-Math-7B-Instruct(TIR,Greedy)
Mathematical ReasoningMATHAccuracy83.6Qwen2.5-Math-7B-Instruct(COT,Greedy)
Mathematical ReasoningMATHParameters (Billions)7Qwen2.5-Math-7B-Instruct(COT,Greedy)
Mathematical ReasoningMATHAccuracy79.9Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)
Mathematical ReasoningMATHParameters (Billions)1.5Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)
Mathematical ReasoningMATHAccuracy75.8Qwen2.5-Math-1.5B-Instruct(COT,Greedy)
Mathematical ReasoningMATHParameters (Billions)1.5Qwen2.5-Math-1.5B-Instruct(COT,Greedy)

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems2025-07-17A Survey of Deep Learning for Geometry Problem Solving2025-07-16Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression2025-07-16KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?2025-07-15Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15