Section 08

Impact: How This Changed the Field

Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling 2023

This paper transformed how researchers think about training AI models to solve hard problems. Here’s what changed:


1. OpenAI o1 and Process Supervision at Scale

Direct influence: OpenAI’s o1 model (announced in late 2024) is built on process supervision. The model is trained using a PRM-like approach: reward each step of reasoning, not just the final answer.

The result: o1 can solve complex math and reasoning tasks at near-human level by learning to slow down, think step-by-step, and get intermediate steps right.

Key evidence: On the MATH benchmark, o1 scores 96%, vastly outperforming earlier models that used only outcome supervision.


2. Influenced Mathematical Reasoning Systems

AlphaProof (DeepMind, 2024): A system for proving mathematical theorems. Uses a PRM-like architecture to evaluate intermediate proof steps, not just whether the final proof is valid.

Comparison:

  • Old approach: Train on entire proofs; reward only if the proof is complete and correct
  • New approach (inspired by this paper): Train on proof steps; reward individual inferences; detect errors early

Result: AlphaProof solved problems from the International Mathematical Olympiad (IMO), the first AI to do so.


3. Established Best-of-N as the Standard Inference Method

Before this paper: Best-of-N selection existed, but it was mostly used with crude reward models (like outcome scoring on final answers).

After this paper: Best-of-N with a PRM became the standard evaluation pipeline for math reasoning tasks. Researchers now routinely:

  1. Generate N candidate solutions (N=1, 4, 8, 16, …)
  2. Score each with a reward model
  3. Return the top-1 solution

Impact: This decouples problem-solving (generation) from solution quality assessment (ranking). You can train a smaller model that’s good at generating candidates and a larger model that’s good at evaluating them.


4. Motivated the “Test-Time Compute” Paradigm

The insight: If you can generate multiple candidates and pick the best, you can trade compute time for accuracy at inference.

The formula: Instead of training a larger model, train a smaller model and allocate more compute at test time (generate more candidates, evaluate more carefully).

Modern examples:

  • Speculative decoding: Generate multiple tokens in parallel, verify with a PRM
  • Ensemble methods: Run multiple inference passes, select via a reward model
  • Multi-step RL training: Use process-level rewards during RL to refine intermediate outputs

Connection to Paper 23 (Scaling Test-Time Compute): That paper directly builds on this insight: process supervision enables efficient test-time scaling.


5. Enabled the PRM800K Dataset to Become a Research Resource

Open-source release: OpenAI released the PRM800K dataset for research (subject to licensing).

Impact: Hundreds of researchers have used it to:

  • Train PRMs for their own applications
  • Study step-level feedback in other domains
  • Develop new training algorithms that leverage per-step rewards

Derivative work: Code Llama, Llama-70B, and other models have been fine-tuned using PRM-style ideas.


6. Shifted Industry Focus from Outcome to Process

Before: The dominant paradigm in RLHF was outcome-based (use outcome models to steer RL). This worked, but was noisy and sample-inefficient.

After: The community increasingly explores process-based rewards:

  • Constitutional AI uses step-by-step feedback
  • Reasoning-focused models (like o1) use process rewards
  • New research on “intermediate reward modeling” explores step-level supervision

Broader implication: Process supervision is now seen as a key lever for improving AI alignment and reasoning.


7. Practical Applications in Industry

Verification tools: Companies building code verification or math tutoring systems now use PRM-style approaches to check intermediate steps.

Example: A math tutoring system can now not only mark homework right or wrong (outcome), but also provide feedback on specific steps where students went wrong (process feedback).

Reasoning benchmarks: New benchmarks (like ARC Challenge, Competition Math) now use process-level evaluation because it better captures reasoning quality.


8. Opened Questions for Future Research

Follow-up challenges:

  1. Step definition: How to automatically extract steps in complex reasoning?
  2. Generalization: Does process supervision work outside math (coding, medicine, law)?
  3. Scalability: Can we annotate millions of steps efficiently?
  4. Interpretability: Can we use step-level rewards to understand what the model is thinking?

These questions drive ongoing research.


Timeline: This Paper’s Influence

2023 (March): Paper published (arXiv)
2023 (September): Paper accepted at ICLR 2024
2023-2024: Community experiments with PRM800K
2024: OpenAI announces o1, built on process supervision
2024: DeepMind releases AlphaProof (uses process rewards)
2024-2025: Widespread adoption in reasoning-focused models

Summary

This paper didn’t introduce process supervision (it was discussed earlier), but it:

  • Proved it works at scale with systematic experiments
  • Released data (PRM800K) enabling further research
  • Influenced the top labs (OpenAI, DeepMind, Meta) to adopt process-based training
  • Shifted the paradigm from “reward the final answer” to “reward good intermediate reasoning”

For a paper that could have been just an incremental improvement, it had outsized influence on the field’s direction.