TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Math-Shepherd: Verify and Reinforce LLMs Step-by-step with...

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei LI, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, Zhifang Sui

2023-12-14Mathematical ReasoningMathRerankingMath Word Problem SolvingGSM8KArithmetic Reasoning
PaperPDFCodeCodeCode

Abstract

In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.

Results

TaskDatasetMetricValueModel
Question AnsweringMATHAccuracy48.1Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)
Question AnsweringMATHParameters (Billions)67Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)
Question AnsweringMATHAccuracy43.5Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
Question AnsweringMATHParameters (Billions)7Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
Question AnsweringMATHAccuracy33Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
Question AnsweringMATHParameters (Billions)7Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
Math Word Problem SolvingMATHAccuracy48.1Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)
Math Word Problem SolvingMATHParameters (Billions)67Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)
Math Word Problem SolvingMATHAccuracy43.5Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
Math Word Problem SolvingMATHParameters (Billions)7Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
Math Word Problem SolvingMATHAccuracy33Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
Math Word Problem SolvingMATHParameters (Billions)7Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
Mathematical Question AnsweringMATHAccuracy48.1Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)
Mathematical Question AnsweringMATHParameters (Billions)67Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)
Mathematical Question AnsweringMATHAccuracy43.5Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
Mathematical Question AnsweringMATHParameters (Billions)7Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
Mathematical Question AnsweringMATHAccuracy33Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
Mathematical Question AnsweringMATHParameters (Billions)7Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
Mathematical ReasoningMATHAccuracy48.1Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)
Mathematical ReasoningMATHParameters (Billions)67Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)
Mathematical ReasoningMATHAccuracy43.5Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
Mathematical ReasoningMATHParameters (Billions)7Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
Mathematical ReasoningMATHAccuracy33Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
Mathematical ReasoningMATHParameters (Billions)7Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
Arithmetic ReasoningGSM8KAccuracy89.1Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
Arithmetic ReasoningGSM8KParameters (Billion)7Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)
Arithmetic ReasoningGSM8KAccuracy84.1Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)
Arithmetic ReasoningGSM8KParameters (Billion)7Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems2025-07-17A Survey of Deep Learning for Geometry Problem Solving2025-07-16Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression2025-07-16KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?2025-07-15Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15