Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling
What This Paper Did
Imagine you hire a teacher to grade your maths homework. The old way: the teacher looks only at your final answer. If you wrote “42” and it’s correct, full marks — even if all your steps were nonsense. The new way: the teacher checks every step. They mark step 1, step 2, step 3, step 4 independently. If step 3 is wrong, they spot it immediately.
That’s the core insight of “Let’s Verify Step by Step.” OpenAI researchers trained two types of reward models for mathematical reasoning problems:
- Outcome Reward Models (ORM): Judge only the final answer. Binary reward: 1 if correct, 0 if wrong.
- Process Reward Models (PRM): Judge each intermediate reasoning step. Score each step independently, then combine.
The paper’s main finding: Process reward models are far more robust than outcome reward models for selecting the best solution from multiple candidates. This matters because modern LLMs generate many candidate solutions; the reward model picks the “best” one. If your selector is bad, your output is bad.
To prove this, the researchers:
- Created PRM800K: a dataset of 800,000 human-annotated reasoning steps from GPT-4 solutions to competition math problems
- Trained a PRM to score each step (0 = wrong, 1 = correct)
- Compared PRM performance to ORM on the MATH benchmark (500 difficult problems)
- Showed PRM + best-of-N selection dramatically outperforms ORM + best-of-N
Key equations:
ORM score: R_ORM(x, y) = 1 if final_answer(y) is correct, else 0
PRM score: R_PRM(x, y) = ∏_{i=1}^{T} p_i
where p_i = P(step_i is correct | x, y_{1:i})
Best-of-N: argmax_{i∈1..N} R(x, y_i)
[select the solution with highest reward]
The key observation: ORM is noisy. A wrong answer can result from mostly correct reasoning (unlucky algebra error at the end). A right answer can result from completely invalid reasoning (lucky coincidence). PRM captures the quality of reasoning directly, because it judges each step.
The Indian Analogy
For ORM: Your school entrance exam has only one number to grade: your final score out of 100. A student who writes down “42 is the answer” and it happens to be correct gets full marks, even though their test paper shows they didn’t understand anything. Another student who understood the entire solution method but made one arithmetic error at the very end fails. The exam doesn’t reward thinking — only results.
For PRM: Now imagine a teacher who goes through your working line by line. “Step 1: correct. Step 2: correct. Step 3: wrong — you made a sign error. Step 4: would have been correct if step 3 were right.” The teacher understands your reasoning process. A lucky guess gets no credit. Good reasoning gets credit even if one small step derailed you.
Why this matters in India: Competitive exams (JEE, NEET, AIIMS) give partial credit for correct reasoning even if the final answer is wrong. The evaluator rewards the process. LLMs generate candidate solutions in batches. If you reward only the answer, the model learns to guess. If you reward the steps, the model learns to think.
Read in This Order
| Section | What You Will Learn | Difficulty | Time |
|---|---|---|---|
| 01 - Context | Why RLHF reward models are hard to train; the failure of outcome supervision | Beginner | 5 min |
| 02 - The Problem | Concrete example: how ORM picks wrong solutions; data-efficiency issues | Intermediate | 5 min |
| 03 - The Idea | What is process supervision; the step-level annotation scheme; intuition | Intermediate | 8 min |
| 04 - The Math | Mathematical formulation of ORM vs. PRM; product and minimum scores | Intermediate | 8 min |
| 05 - Worked Example | Trace a 4-step solution; compute per-step scores, ORM, PRM, min scores | Intermediate | 6 min |
| 06 - The Code | Implement best-of-N selection; compare ORM and PRM on mock solutions | Beginner | 5 min |
| 07 - Limitations | Data cost, domain specificity, credit assignment, step definition | Advanced | 4 min |
| 08 - Impact | How this influenced OpenAI o1, AlphaProof, test-time compute | Intermediate | 3 min |
| 09 - Summary | One-line recap, main ideas, key numbers, what comes next | Beginner | 1 min |
Before You Read: Math and AI Concepts You’ll Need
- Rewards and Scoring: Cross-Entropy Loss — foundation for understanding how we assign feedback to model outputs
- Chain-of-Thought (Paper 14): The reasoning steps come from LLMs using CoT; PRMs evaluate these steps
- RLHF (Paper 15): Reward models are the core component of RLHF; this paper improves them
- Probability: Basic probability notation and conditional probability P(A|B)
Visual Overview
Math Problem
|
______________|______________
| |
GPT-4 Human Verifiers
| |
Generate N Annotate each step
solutions in each solution
| |
|______________________________|
|
PRM800K Dataset
(800K step annotations)
|
Train Reward Model
| |
ORM PRM
(1) (p1, p2, ..., pT)
| |
Best-of-N Selection
|
Best Solution
|
Final Answer
Navigation
← Paper 15: Training Language Models to Follow Instructions from Human Feedback (InstructGPT) | Paper 17: LLaMA →
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.