Impact: How This Changed the Field — Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling

This paper transformed how researchers think about training AI models to solve hard problems. Here’s what changed:

1. OpenAI o1 and Process Supervision at Scale

Direct influence: OpenAI’s o1 model (announced in late 2024) is built on process supervision. The model is trained using a PRM-like approach: reward each step of reasoning, not just the final answer.

The result: o1 can solve complex math and reasoning tasks at near-human level by learning to slow down, think step-by-step, and get intermediate steps right.

Key evidence: On the MATH benchmark, o1 scores 96%, vastly outperforming earlier models that used only outcome supervision.

2. Influenced Mathematical Reasoning Systems

AlphaProof (DeepMind, 2024): A system for proving mathematical theorems. Uses a PRM-like architecture to evaluate intermediate proof steps, not just whether the final proof is valid.

Comparison:

Old approach: Train on entire proofs; reward only if the proof is complete and correct
New approach (inspired by this paper): Train on proof steps; reward individual inferences; detect errors early

Result: AlphaProof solved problems from the International Mathematical Olympiad (IMO), the first AI to do so.

3. Established Best-of-N as the Standard Inference Method

Before this paper: Best-of-N selection existed, but it was mostly used with crude reward models (like outcome scoring on final answers).

After this paper: Best-of-N with a PRM became the standard evaluation pipeline for math reasoning tasks. Researchers now routinely:

Generate N candidate solutions (N=1, 4, 8, 16, …)
Score each with a reward model
Return the top-1 solution

Impact: This decouples problem-solving (generation) from solution quality assessment (ranking). You can train a smaller model that’s good at generating candidates and a larger model that’s good at evaluating them.

4. Motivated the “Test-Time Compute” Paradigm

The insight: If you can generate multiple candidates and pick the best, you can trade compute time for accuracy at inference.

The formula: Instead of training a larger model, train a smaller model and allocate more compute at test time (generate more candidates, evaluate more carefully).

Modern examples:

Speculative decoding: Generate multiple tokens in parallel, verify with a PRM
Ensemble methods: Run multiple inference passes, select via a reward model
Multi-step RL training: Use process-level rewards during RL to refine intermediate outputs

Connection to Paper 23 (Scaling Test-Time Compute): That paper directly builds on this insight: process supervision enables efficient test-time scaling.

5. Enabled the PRM800K Dataset to Become a Research Resource

Open-source release: OpenAI released the PRM800K dataset for research (subject to licensing).

Impact: Hundreds of researchers have used it to:

Train PRMs for their own applications
Study step-level feedback in other domains
Develop new training algorithms that leverage per-step rewards

Derivative work: Code Llama, Llama-70B, and other models have been fine-tuned using PRM-style ideas.

6. Shifted Industry Focus from Outcome to Process

Before: The dominant paradigm in RLHF was outcome-based (use outcome models to steer RL). This worked, but was noisy and sample-inefficient.

After: The community increasingly explores process-based rewards:

Constitutional AI uses step-by-step feedback
Reasoning-focused models (like o1) use process rewards
New research on “intermediate reward modeling” explores step-level supervision

Broader implication: Process supervision is now seen as a key lever for improving AI alignment and reasoning.

7. Practical Applications in Industry

Verification tools: Companies building code verification or math tutoring systems now use PRM-style approaches to check intermediate steps.

Example: A math tutoring system can now not only mark homework right or wrong (outcome), but also provide feedback on specific steps where students went wrong (process feedback).

Reasoning benchmarks: New benchmarks (like ARC Challenge, Competition Math) now use process-level evaluation because it better captures reasoning quality.

8. Opened Questions for Future Research

Follow-up challenges:

Step definition: How to automatically extract steps in complex reasoning?
Generalization: Does process supervision work outside math (coding, medicine, law)?
Scalability: Can we annotate millions of steps efficiently?
Interpretability: Can we use step-level rewards to understand what the model is thinking?

These questions drive ongoing research.

Timeline: This Paper’s Influence

2023 (March): Paper published (arXiv)
2023 (September): Paper accepted at ICLR 2024
2023-2024: Community experiments with PRM800K
2024: OpenAI announces o1, built on process supervision
2024: DeepMind releases AlphaProof (uses process rewards)
2024-2025: Widespread adoption in reasoning-focused models

Summary

This paper didn’t introduce process supervision (it was discussed earlier), but it:

Proved it works at scale with systematic experiments
Released data (PRM800K) enabling further research
Influenced the top labs (OpenAI, DeepMind, Meta) to adopt process-based training
Shifted the paradigm from “reward the final answer” to “reward good intermediate reasoning”

For a paper that could have been just an incremental improvement, it had outsized influence on the field’s direction.