This paper transformed how researchers think about training AI models to solve hard problems. Here’s what changed:
1. OpenAI o1 and Process Supervision at Scale
Direct influence: OpenAI’s o1 model (announced in late 2024) is built on process supervision. The model is trained using a PRM-like approach: reward each step of reasoning, not just the final answer.
The result: o1 can solve complex math and reasoning tasks at near-human level by learning to slow down, think step-by-step, and get intermediate steps right.
Key evidence: On the MATH benchmark, o1 scores 96%, vastly outperforming earlier models that used only outcome supervision.
2. Influenced Mathematical Reasoning Systems
AlphaProof (DeepMind, 2024): A system for proving mathematical theorems. Uses a PRM-like architecture to evaluate intermediate proof steps, not just whether the final proof is valid.
Comparison:
- Old approach: Train on entire proofs; reward only if the proof is complete and correct
- New approach (inspired by this paper): Train on proof steps; reward individual inferences; detect errors early
Result: AlphaProof solved problems from the International Mathematical Olympiad (IMO), the first AI to do so.
3. Established Best-of-N as the Standard Inference Method
Before this paper: Best-of-N selection existed, but it was mostly used with crude reward models (like outcome scoring on final answers).
After this paper: Best-of-N with a PRM became the standard evaluation pipeline for math reasoning tasks. Researchers now routinely:
- Generate N candidate solutions (N=1, 4, 8, 16, …)
- Score each with a reward model
- Return the top-1 solution
Impact: This decouples problem-solving (generation) from solution quality assessment (ranking). You can train a smaller model that’s good at generating candidates and a larger model that’s good at evaluating them.
4. Motivated the “Test-Time Compute” Paradigm
The insight: If you can generate multiple candidates and pick the best, you can trade compute time for accuracy at inference.
The formula: Instead of training a larger model, train a smaller model and allocate more compute at test time (generate more candidates, evaluate more carefully).
Modern examples:
- Speculative decoding: Generate multiple tokens in parallel, verify with a PRM
- Ensemble methods: Run multiple inference passes, select via a reward model
- Multi-step RL training: Use process-level rewards during RL to refine intermediate outputs
Connection to Paper 23 (Scaling Test-Time Compute): That paper directly builds on this insight: process supervision enables efficient test-time scaling.
5. Enabled the PRM800K Dataset to Become a Research Resource
Open-source release: OpenAI released the PRM800K dataset for research (subject to licensing).
Impact: Hundreds of researchers have used it to:
- Train PRMs for their own applications
- Study step-level feedback in other domains
- Develop new training algorithms that leverage per-step rewards
Derivative work: Code Llama, Llama-70B, and other models have been fine-tuned using PRM-style ideas.
6. Shifted Industry Focus from Outcome to Process
Before: The dominant paradigm in RLHF was outcome-based (use outcome models to steer RL). This worked, but was noisy and sample-inefficient.
After: The community increasingly explores process-based rewards:
- Constitutional AI uses step-by-step feedback
- Reasoning-focused models (like o1) use process rewards
- New research on “intermediate reward modeling” explores step-level supervision
Broader implication: Process supervision is now seen as a key lever for improving AI alignment and reasoning.
7. Practical Applications in Industry
Verification tools: Companies building code verification or math tutoring systems now use PRM-style approaches to check intermediate steps.
Example: A math tutoring system can now not only mark homework right or wrong (outcome), but also provide feedback on specific steps where students went wrong (process feedback).
Reasoning benchmarks: New benchmarks (like ARC Challenge, Competition Math) now use process-level evaluation because it better captures reasoning quality.
8. Opened Questions for Future Research
Follow-up challenges:
- Step definition: How to automatically extract steps in complex reasoning?
- Generalization: Does process supervision work outside math (coding, medicine, law)?
- Scalability: Can we annotate millions of steps efficiently?
- Interpretability: Can we use step-level rewards to understand what the model is thinking?
These questions drive ongoing research.
Timeline: This Paper’s Influence
2023 (March): Paper published (arXiv)
2023 (September): Paper accepted at ICLR 2024
2023-2024: Community experiments with PRM800K
2024: OpenAI announces o1, built on process supervision
2024: DeepMind releases AlphaProof (uses process rewards)
2024-2025: Widespread adoption in reasoning-focused models
Summary
This paper didn’t introduce process supervision (it was discussed earlier), but it:
- Proved it works at scale with systematic experiments
- Released data (PRM800K) enabling further research
- Influenced the top labs (OpenAI, DeepMind, Meta) to adopt process-based training
- Shifted the paradigm from “reward the final answer” to “reward good intermediate reasoning”
For a paper that could have been just an incremental improvement, it had outsized influence on the field’s direction.