Let’s make the limitation concrete.
Example 1: Wrong reasoning, right answer
Problem: Solve 2x + 3 = 11.
Solution A (generated by LLM):
Step 1: Subtract 3 from both sides: 2x = 8
Step 2: Divide by 2: x = 4
Step 3: Check: 2(4) + 3 = 8 + 3 = 11. Correct.
Final Answer: x = 4
Solution B (also generated by LLM):
Step 1: 2x + 3 = 11, so x = (11 - 3) / 2 = 4 (jumping straight to factored form, which is mathematically valid)
Step 2: But let me double-check by guessing: if x = 4, then 2(4) + 3 = 11. Yes!
Step 3: Actually, I realize I skipped steps. Let me verify by trying x = 3: 2(3) + 3 = 9, not 11.
Step 4: So x must be 4.
Final Answer: x = 4
Both have final answer = 4, which is correct. An outcome reward model gives both a score of 1.
But a human reader sees:
- Solution A: Clear, direct, mathematically sound, efficient.
- Solution B: Rambling, inefficient, contains redundant checks, but arrives at the right answer.
If you want to train an LLM to generate clear, efficient reasoning, you want to reward Solution A and penalize Solution B. An ORM cannot do this.
Example 2: Right reasoning, wrong answer
Problem: Integrate x² from 0 to 3.
Solution:
Step 1: The antiderivative of x² is x³/3. [CORRECT]
Step 2: Evaluate at x=3: (3)³/3 = 27/3 = 9. [CORRECT]
Step 3: Evaluate at x=0: 0. [CORRECT]
Step 4: The integral is 9 - 0 = 9. [CORRECT]
Final Answer: 9
This is entirely correct. But now suppose the LLM made one arithmetic error:
Step 1: The antiderivative of x² is x³/3. [CORRECT]
Step 2: Evaluate at x=3: (3)³/3 = 27/2 = 13.5. [WRONG — should be 27/3 = 9]
Step 3: Evaluate at x=0: 0. [CORRECT]
Step 4: The integral is 13.5 - 0 = 13.5. [Wrong because Step 2 was wrong]
Final Answer: 13.5
An outcome reward model gives this a score of 0 (wrong final answer), even though 3 out of 4 steps were perfectly sound. The error is isolated to one arithmetic operation.
Why this matters: If you’re using best-of-N selection (generate 100 solutions, pick the one with highest reward), and one solution has solid reasoning but makes a single arithmetic error, the ORM discards it entirely. The ORM can only see “that output is wrong; discard it.” It cannot see “that output is 75% correct in reasoning.”
The data efficiency problem:
For RLHF to work well, you need many human annotations. But human time is limited. Suppose you have 100 solutions to annotate:
-
Outcome supervision: 1 label per solution (correct/incorrect). 100 labels total. Each label conveys: “Does the final answer match the ground truth?”
-
Process supervision: 1 label per step per solution. If solutions average 5 steps, that’s 500 labels total. But each label conveys: “Is this specific step correct?” This is much richer information.
Wait — doesn’t process supervision require 5x more annotation effort? Not quite. A human annotator who reads a solution naturally reads each step. Asking them to mark steps is not much more work than asking them to mark the final answer. It’s still one human reading one solution, but now the signal is 5x richer.
The core question: Given limited human annotation budget, is it better to annotate the outcome of 500 solutions, or the process (each step) of 100 solutions?
This paper’s answer: Process supervision on fewer solutions beats outcome supervision on more solutions. The step-level signal is so much richer that it more than compensates for needing fewer solutions.