Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, Daya Guo
Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | MATH | Accuracy | 58.8 | DeepSeekMATH-RL-7B (w/ code, greedy decoding) |
| Question Answering | MATH | Parameters (Billions) | 7 | DeepSeekMATH-RL-7B (w/ code, greedy decoding) |
| Question Answering | MATH | Accuracy | 51.7 | DeepSeekMATH-RL-7B (greedy decoding) |
| Question Answering | MATH | Parameters (Billions) | 7 | DeepSeekMATH-RL-7B (greedy decoding) |
| Math Word Problem Solving | MATH | Accuracy | 58.8 | DeepSeekMATH-RL-7B (w/ code, greedy decoding) |
| Math Word Problem Solving | MATH | Parameters (Billions) | 7 | DeepSeekMATH-RL-7B (w/ code, greedy decoding) |
| Math Word Problem Solving | MATH | Accuracy | 51.7 | DeepSeekMATH-RL-7B (greedy decoding) |
| Math Word Problem Solving | MATH | Parameters (Billions) | 7 | DeepSeekMATH-RL-7B (greedy decoding) |
| Mathematical Question Answering | MATH | Accuracy | 58.8 | DeepSeekMATH-RL-7B (w/ code, greedy decoding) |
| Mathematical Question Answering | MATH | Parameters (Billions) | 7 | DeepSeekMATH-RL-7B (w/ code, greedy decoding) |
| Mathematical Question Answering | MATH | Accuracy | 51.7 | DeepSeekMATH-RL-7B (greedy decoding) |
| Mathematical Question Answering | MATH | Parameters (Billions) | 7 | DeepSeekMATH-RL-7B (greedy decoding) |
| Mathematical Reasoning | MATH | Accuracy | 58.8 | DeepSeekMATH-RL-7B (w/ code, greedy decoding) |
| Mathematical Reasoning | MATH | Parameters (Billions) | 7 | DeepSeekMATH-RL-7B (w/ code, greedy decoding) |
| Mathematical Reasoning | MATH | Accuracy | 51.7 | DeepSeekMATH-RL-7B (greedy decoding) |
| Mathematical Reasoning | MATH | Parameters (Billions) | 7 | DeepSeekMATH-RL-7B (greedy decoding) |
| Arithmetic Reasoning | GSM8K | Accuracy | 88.2 | DeepSeekMATH-RL-7B |
| Arithmetic Reasoning | GSM8K | Parameters (Billion) | 7 | DeepSeekMATH-RL-7B |