Peiyi Wang, Lei LI, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, Zhifang Sui
In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | MATH | Accuracy | 48.1 | Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256) |
| Question Answering | MATH | Parameters (Billions) | 67 | Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256) |
| Question Answering | MATH | Accuracy | 43.5 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) |
| Question Answering | MATH | Parameters (Billions) | 7 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) |
| Question Answering | MATH | Accuracy | 33 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) |
| Question Answering | MATH | Parameters (Billions) | 7 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) |
| Math Word Problem Solving | MATH | Accuracy | 48.1 | Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256) |
| Math Word Problem Solving | MATH | Parameters (Billions) | 67 | Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256) |
| Math Word Problem Solving | MATH | Accuracy | 43.5 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) |
| Math Word Problem Solving | MATH | Parameters (Billions) | 7 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) |
| Math Word Problem Solving | MATH | Accuracy | 33 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) |
| Math Word Problem Solving | MATH | Parameters (Billions) | 7 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) |
| Mathematical Question Answering | MATH | Accuracy | 48.1 | Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256) |
| Mathematical Question Answering | MATH | Parameters (Billions) | 67 | Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256) |
| Mathematical Question Answering | MATH | Accuracy | 43.5 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) |
| Mathematical Question Answering | MATH | Parameters (Billions) | 7 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) |
| Mathematical Question Answering | MATH | Accuracy | 33 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) |
| Mathematical Question Answering | MATH | Parameters (Billions) | 7 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) |
| Mathematical Reasoning | MATH | Accuracy | 48.1 | Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256) |
| Mathematical Reasoning | MATH | Parameters (Billions) | 67 | Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256) |
| Mathematical Reasoning | MATH | Accuracy | 43.5 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) |
| Mathematical Reasoning | MATH | Parameters (Billions) | 7 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) |
| Mathematical Reasoning | MATH | Accuracy | 33 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) |
| Mathematical Reasoning | MATH | Parameters (Billions) | 7 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) |
| Arithmetic Reasoning | GSM8K | Accuracy | 89.1 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) |
| Arithmetic Reasoning | GSM8K | Parameters (Billion) | 7 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) |
| Arithmetic Reasoning | GSM8K | Accuracy | 84.1 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) |
| Arithmetic Reasoning | GSM8K | Parameters (Billion) | 7 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) |