Paper 16
Intermediate

Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling

What This Paper Did

Imagine you hire a teacher to grade your maths homework. The old way: the teacher looks only at your final answer. If you wrote “42” and it’s correct, full marks — even if all your steps were nonsense. The new way: the teacher checks every step. They mark step 1, step 2, step 3, step 4 independently. If step 3 is wrong, they spot it immediately.

That’s the core insight of “Let’s Verify Step by Step.” OpenAI researchers trained two types of reward models for mathematical reasoning problems:

  1. Outcome Reward Models (ORM): Judge only the final answer. Binary reward: 1 if correct, 0 if wrong.
  2. Process Reward Models (PRM): Judge each intermediate reasoning step. Score each step independently, then combine.

The paper’s main finding: Process reward models are far more robust than outcome reward models for selecting the best solution from multiple candidates. This matters because modern LLMs generate many candidate solutions; the reward model picks the “best” one. If your selector is bad, your output is bad.

To prove this, the researchers:

  • Created PRM800K: a dataset of 800,000 human-annotated reasoning steps from GPT-4 solutions to competition math problems
  • Trained a PRM to score each step (0 = wrong, 1 = correct)
  • Compared PRM performance to ORM on the MATH benchmark (500 difficult problems)
  • Showed PRM + best-of-N selection dramatically outperforms ORM + best-of-N

Key equations:

ORM score:   R_ORM(x, y) = 1 if final_answer(y) is correct, else 0

PRM score:   R_PRM(x, y) = ∏_{i=1}^{T} p_i
             where p_i = P(step_i is correct | x, y_{1:i})

Best-of-N:   argmax_{i∈1..N} R(x, y_i)
             [select the solution with highest reward]

The key observation: ORM is noisy. A wrong answer can result from mostly correct reasoning (unlucky algebra error at the end). A right answer can result from completely invalid reasoning (lucky coincidence). PRM captures the quality of reasoning directly, because it judges each step.


The Indian Analogy

For ORM: Your school entrance exam has only one number to grade: your final score out of 100. A student who writes down “42 is the answer” and it happens to be correct gets full marks, even though their test paper shows they didn’t understand anything. Another student who understood the entire solution method but made one arithmetic error at the very end fails. The exam doesn’t reward thinking — only results.

For PRM: Now imagine a teacher who goes through your working line by line. “Step 1: correct. Step 2: correct. Step 3: wrong — you made a sign error. Step 4: would have been correct if step 3 were right.” The teacher understands your reasoning process. A lucky guess gets no credit. Good reasoning gets credit even if one small step derailed you.

Why this matters in India: Competitive exams (JEE, NEET, AIIMS) give partial credit for correct reasoning even if the final answer is wrong. The evaluator rewards the process. LLMs generate candidate solutions in batches. If you reward only the answer, the model learns to guess. If you reward the steps, the model learns to think.


Read in This Order

SectionWhat You Will LearnDifficultyTime
01 - ContextWhy RLHF reward models are hard to train; the failure of outcome supervisionBeginner5 min
02 - The ProblemConcrete example: how ORM picks wrong solutions; data-efficiency issuesIntermediate5 min
03 - The IdeaWhat is process supervision; the step-level annotation scheme; intuitionIntermediate8 min
04 - The MathMathematical formulation of ORM vs. PRM; product and minimum scoresIntermediate8 min
05 - Worked ExampleTrace a 4-step solution; compute per-step scores, ORM, PRM, min scoresIntermediate6 min
06 - The CodeImplement best-of-N selection; compare ORM and PRM on mock solutionsBeginner5 min
07 - LimitationsData cost, domain specificity, credit assignment, step definitionAdvanced4 min
08 - ImpactHow this influenced OpenAI o1, AlphaProof, test-time computeIntermediate3 min
09 - SummaryOne-line recap, main ideas, key numbers, what comes nextBeginner1 min

Before You Read: Math and AI Concepts You’ll Need

  • Rewards and Scoring: Cross-Entropy Loss — foundation for understanding how we assign feedback to model outputs
  • Chain-of-Thought (Paper 14): The reasoning steps come from LLMs using CoT; PRMs evaluate these steps
  • RLHF (Paper 15): Reward models are the core component of RLHF; this paper improves them
  • Probability: Basic probability notation and conditional probability P(A|B)

Visual Overview

                    Math Problem
                         |
           ______________|______________
          |                             |
      GPT-4                         Human Verifiers
       |                              |
    Generate N                    Annotate each step
    solutions                     in each solution
       |                              |
       |______________________________|
                     |
            PRM800K Dataset
        (800K step annotations)
                     |
          Train Reward Model
                |        |
              ORM        PRM
              (1)       (p1, p2, ..., pT)
                |        |
          Best-of-N Selection
                |
         Best Solution
                |
          Final Answer

Paper 15: Training Language Models to Follow Instructions from Human Feedback (InstructGPT) | Paper 17: LLaMA →

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.