Context: The RLHF Bottleneck and the Rise of Reward Models — Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling

By 2023, OpenAI and other labs had perfected Reinforcement Learning from Human Feedback (RLHF) — the technique from Paper 15 that made GPT-3.5 and GPT-4 alignment happen. The basic idea is simple: ask humans to score model outputs (“good response,” “bad response”), train a reward model on those scores, then use the reward model to steer training via RL.

But there was a problem hiding in plain sight.

When humans score a model output, what are they actually scoring?

Typically, they score the outcome. “Does this answer solve the problem? Yes or no.” For a math problem, this means: “Is the final numerical answer correct?” For a summarization task: “Is the summary accurate and concise?” For code generation: “Does the code run?”

This is called an Outcome Reward Model (ORM), and it feels intuitive. But it has a critical weakness: the outcome alone tells you nothing about the quality of the reasoning.

Consider a student solving a calculus problem. She writes:

Integrate 2x + 3 from 0 to 5.

Step 1: Antiderivative is x² + 3x Step 2: Evaluate at bounds: (25 + 15) - (0) = 40 Answer: 40

The final answer is correct. An outcome reward model gives full marks.

Now consider another student who writes:

Integrate 2x + 3 from 0 to 5.

Step 1: Antiderivative is x² + 3x + 10 Step 2: Evaluate at bounds: (25 + 15 + 10) - (0 + 0 + 10) = 40 Answer: 40

The antiderivative is wrong (that constant term shouldn’t be there), but they also subtract it away at the end, so the final answer is still correct. An ORM gives full marks here too.

Both students get “1” for outcome, but one actually knows calculus and one got lucky. An outcome reward model cannot tell the difference.

Why is this a problem for scaling? Because modern LLMs generate solutions in batches. If you ask GPT-4 to solve a problem, it might output 10 different attempted solutions. You run them all, and one happens to be correct. An ORM tells you “yes, that one is right” and you use it. But you’re rewarding luck, not reasoning.

The bigger issue: human annotation is expensive. If you want to scale RLHF, you need thousands of human annotators. Asking them to score only the final answer seems efficient (one binary label per output). But you’re discarding information — the human reader reads the entire reasoning. They see step 1, step 2, step 3, step 4. A human naturally evaluates each step. An ORM throws that rich signal away and keeps only the final bit.

By early 2023, this limitation was becoming clear. Models like GPT-4 were being trained with RLHF, but the reward models were noisy and hard to improve. Something better was needed.

The key insight: Why not ask humans to annotate at the step level? Have them mark each intermediate reasoning step as correct or incorrect. This is more work, but it captures the actual reasoning quality. And if humans are going to read the entire solution anyway, asking them to mark steps is not much harder than asking them to judge the outcome.

This is the historical moment when OpenAI and the community shifted from outcome supervision to process supervision — the core innovation of this paper.