Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling

What This Paper Did

Imagine you hire a teacher to grade your maths homework. The old way: the teacher looks only at your final answer. If you wrote “42” and it’s correct, full marks — even if all your steps were nonsense. The new way: the teacher checks every step. They mark step 1, step 2, step 3, step 4 independently. If step 3 is wrong, they spot it immediately.

That’s the core insight of “Let’s Verify Step by Step.” OpenAI researchers trained two types of reward models for mathematical reasoning problems:

Outcome Reward Models (ORM): Judge only the final answer. Binary reward: 1 if correct, 0 if wrong.
Process Reward Models (PRM): Judge each intermediate reasoning step. Score each step independently, then combine.

The paper’s main finding: Process reward models are far more robust than outcome reward models for selecting the best solution from multiple candidates. This matters because modern LLMs generate many candidate solutions; the reward model picks the “best” one. If your selector is bad, your output is bad.

To prove this, the researchers:

Created PRM800K: a dataset of 800,000 human-annotated reasoning steps from GPT-4 solutions to competition math problems
Trained a PRM to score each step (0 = wrong, 1 = correct)
Compared PRM performance to ORM on the MATH benchmark (500 difficult problems)
Showed PRM + best-of-N selection dramatically outperforms ORM + best-of-N

Key equations:

ORM score:   R_ORM(x, y) = 1 if final_answer(y) is correct, else 0

PRM score:   R_PRM(x, y) = ∏_{i=1}^{T} p_i
             where p_i = P(step_i is correct | x, y_{1:i})

Best-of-N:   argmax_{i∈1..N} R(x, y_i)
             [select the solution with highest reward]

The key observation: ORM is noisy. A wrong answer can result from mostly correct reasoning (unlucky algebra error at the end). A right answer can result from completely invalid reasoning (lucky coincidence). PRM captures the quality of reasoning directly, because it judges each step.

The Indian Analogy

For ORM: Your school entrance exam has only one number to grade: your final score out of 100. A student who writes down “42 is the answer” and it happens to be correct gets full marks, even though their test paper shows they didn’t understand anything. Another student who understood the entire solution method but made one arithmetic error at the very end fails. The exam doesn’t reward thinking — only results.

For PRM: Now imagine a teacher who goes through your working line by line. “Step 1: correct. Step 2: correct. Step 3: wrong — you made a sign error. Step 4: would have been correct if step 3 were right.” The teacher understands your reasoning process. A lucky guess gets no credit. Good reasoning gets credit even if one small step derailed you.

Why this matters in India: Competitive exams (JEE, NEET, AIIMS) give partial credit for correct reasoning even if the final answer is wrong. The evaluator rewards the process. LLMs generate candidate solutions in batches. If you reward only the answer, the model learns to guess. If you reward the steps, the model learns to think.

Read in This Order

Section	What You Will Learn	Difficulty	Time
01 - Context	Why RLHF reward models are hard to train; the failure of outcome supervision	Beginner	5 min
02 - The Problem	Concrete example: how ORM picks wrong solutions; data-efficiency issues	Intermediate	5 min
03 - The Idea	What is process supervision; the step-level annotation scheme; intuition	Intermediate	8 min
04 - The Math	Mathematical formulation of ORM vs. PRM; product and minimum scores	Intermediate	8 min
05 - Worked Example	Trace a 4-step solution; compute per-step scores, ORM, PRM, min scores	Intermediate	6 min
06 - The Code	Implement best-of-N selection; compare ORM and PRM on mock solutions	Beginner	5 min
07 - Limitations	Data cost, domain specificity, credit assignment, step definition	Advanced	4 min
08 - Impact	How this influenced OpenAI o1, AlphaProof, test-time compute	Intermediate	3 min
09 - Summary	One-line recap, main ideas, key numbers, what comes next	Beginner	1 min

Before You Read: Math and AI Concepts You’ll Need

Rewards and Scoring: Cross-Entropy Loss — foundation for understanding how we assign feedback to model outputs
Chain-of-Thought (Paper 14): The reasoning steps come from LLMs using CoT; PRMs evaluate these steps
RLHF (Paper 15): Reward models are the core component of RLHF; this paper improves them
Probability: Basic probability notation and conditional probability P(A|B)

Visual Overview

                    Math Problem
                         |
           ______________|______________
          |                             |
      GPT-4                         Human Verifiers
       |                              |
    Generate N                    Annotate each step
    solutions                     in each solution
       |                              |
       |______________________________|
                     |
            PRM800K Dataset
        (800K step annotations)
                     |
          Train Reward Model
                |        |
              ORM        PRM
              (1)       (p1, p2, ..., pT)
                |        |
          Best-of-N Selection
                |
         Best Solution
                |
          Final Answer

← Paper 15: Training Language Models to Follow Instructions from Human Feedback (InstructGPT) | Paper 17: LLaMA →