Let’s walk through a complete example from start to finish.
The Problem
Geometry problem from MATH benchmark:
A rectangle has width 4 and height 3. What is the length of the diagonal?
Ground truth: Using Pythagorean theorem: $d = \sqrt{4^2 + 3^2} = \sqrt{16 + 9} = \sqrt{25} = 5$.
Three Candidate Solutions Generated by GPT-4
The model is asked to solve this problem three times and produces three different solution attempts.
Solution 1: Clear and Correct
Step 1: A diagonal divides the rectangle into two right triangles.
Step 2: By Pythagorean theorem: d² = width² + height²
Step 3: d² = 4² + 3² = 16 + 9 = 25
Step 4: d = √25 = 5
Final Answer: 5
Solution 2: Correct but Verbose
Step 1: We need to find the diagonal of a rectangle with width=4 and height=3.
Step 2: The diagonal of a rectangle can be found using the Pythagorean theorem.
Step 3: Let me first check: is 4² + 3² = 16 + 9 = 25? Yes.
Step 4: So the diagonal is √25.
Step 5: Let me compute: √25 = 5. Let me verify: 5² = 25? Yes.
Step 6: Final answer is 5.
Final Answer: 5
Solution 3: Wrong Due to Arithmetic Error
Step 1: For a rectangle with width and height, the diagonal uses Pythagorean theorem.
Step 2: d² = 4² + 3² = 16 + 9 = 24 [ERROR: Should be 25]
Step 3: d = √24 ≈ 4.899
Step 4: Rounding to a nice number: d ≈ 5
Final Answer: 5 (approximately)
Human Annotation: Step-Level Labels
A human annotator goes through each solution and marks each step.
Solution 1 Annotations
| Step | Content | Correct? | Confidence |
|---|---|---|---|
| 1 | Diagonal divides rectangle into right triangles | ✓ | 0.99 |
| 2 | Pythagorean theorem d² = w² + h² | ✓ | 0.98 |
| 3 | Arithmetic: 16 + 9 = 25 | ✓ | 0.99 |
| 4 | √25 = 5 | ✓ | 0.99 |
Human notes: “Clear, direct, mathematically rigorous. No errors.”
Solution 2 Annotations
| Step | Content | Correct? | Confidence |
|---|---|---|---|
| 1 | Problem setup | ✓ | 0.99 |
| 2 | Mentions Pythagorean theorem | ✓ | 0.98 |
| 3 | Verification: 16 + 9 = 25 | ✓ | 0.99 |
| 4 | d = √25 | ✓ | 0.99 |
| 5 | Verification: 5² = 25 | ✓ | 0.95 |
| 6 | Final answer stated | ✓ | 0.99 |
Human notes: “All correct, but verbose and repetitive. More steps than necessary.”
Solution 3 Annotations
| Step | Content | Correct? | Confidence |
|---|---|---|---|
| 1 | Sets up Pythagorean theorem correctly | ✓ | 0.98 |
| 2 | Arithmetic: 4² + 3² = 24 | ✗ | 0.02 |
| 3 | d = √24 ≈ 4.899 | ✗ | 0.05 |
| 4 | Rounds to 5 | ✗ | 0.10 |
Human notes: “Step 2 contains an arithmetic error (16 + 9 = 25, not 24). This cascades through the rest of the solution.”
Scoring with Outcome Reward Model (ORM)
The ORM only looks at the final answer.
| Solution | Final Answer | Ground Truth | ORM Score |
|---|---|---|---|
| 1 | 5 | 5 | 1 ✓ |
| 2 | 5 | 5 | 1 ✓ |
| 3 | 5 (approx) | 5 | 1 ✓ |
ORM verdict: All three solutions are equally good. ORM score = 1 for all.
Critique: The ORM cannot distinguish Solution 1 (clean, rigorous) from Solution 2 (verbose) from Solution 3 (contains an error that happened to round back to the correct answer). From the ORM’s perspective, they’re all identical.
Scoring with Process Reward Model (PRM)
The PRM looks at each step and multiplies the per-step correctness probabilities.
Solution 1: Product Score
Per-step probabilities: $p_1 = 0.99, p_2 = 0.98, p_3 = 0.99, p_4 = 0.99$
$$R_{\text{PRM}}(\text{Solution 1}) = 0.99 \times 0.98 \times 0.99 \times 0.99 = 0.950$$
Solution 2: Product Score
Per-step probabilities: $p_1 = 0.99, p_2 = 0.98, p_3 = 0.99, p_4 = 0.99, p_5 = 0.95, p_6 = 0.99$
$$R_{\text{PRM}}(\text{Solution 2}) = 0.99 \times 0.98 \times 0.99 \times 0.99 \times 0.95 \times 0.99 = 0.893$$
Note: Solution 2’s score is lower than Solution 1 because it has more steps (6 vs. 4), and each extra step introduces a chance of error. The extra Step 5 (verification) is correct, but it’s still an additional opportunity for error, which reduces the product.
Solution 3: Product Score
Per-step probabilities: $p_1 = 0.98, p_2 = 0.02, p_3 = 0.05, p_4 = 0.10$
$$R_{\text{PRM}}(\text{Solution 3}) = 0.98 \times 0.02 \times 0.05 \times 0.10 = 0.000098$$
The score is tiny because Step 2 is marked incorrect (p₂ = 0.02), and multiplying by 0.02 makes the entire product collapse.
Summary: PRM Scores
| Solution | ORM Score | PRM Score | Ranking |
|---|---|---|---|
| Solution 1 | 1.000 | 0.950 | 1st (best) |
| Solution 2 | 1.000 | 0.893 | 2nd |
| Solution 3 | 1.000 | 0.000098 | 3rd (worst) |
Best-of-N Selection Results
Now suppose we use each reward model to pick the best solution from these 3 candidates.
With ORM:
Scores: [1, 1, 1]
Best solution: Solution 1 (tie-break: first in list)
Verdict: ORM picked a correct solution, but so would solutions 2 or 3.
ORM got lucky. It cannot explain why solution 1 is better.
With PRM:
Scores: [0.950, 0.893, 0.000098]
Best solution: Solution 1
Verdict: PRM clearly ranks solution 1 as best. It also distinguishes
solution 2 (verbose but correct) from solution 3 (contains error).
The Deeper Insight
In this toy example, ORM and PRM both happen to pick Solution 1. But PRM’s choice is justified: Solution 1 is genuinely the best (cleanest reasoning, no errors, no verbosity). PRM “knows” this because it looked at the steps.
ORM’s choice is unjustified: ORM cannot explain why Solution 1 is better than Solutions 2 or 3 — it only knows they all have the right final answer.
Now scale this up: imagine 100 solutions. ORM will have many ties (many solutions with correct final answers). PRM will rank them by quality of reasoning. This is why PRM is more useful for training and for selecting among candidates.
Verification: Manual Check of All Arithmetic
Solution 1, Step 3:
- 4² = 4 × 4 = 16 ✓
- 3² = 3 × 3 = 9 ✓
- 16 + 9 = 25 ✓
Solution 2, Step 3:
- Same as above: 16 + 9 = 25 ✓
Solution 3, Step 2 (ERROR):
- 4² = 16 ✓
- 3² = 9 ✓
- 16 + 9 = 25, NOT 24 ✗
Solution 3, Step 3:
- √24 ≈ 4.899 ✓ (given the wrong input 24)
- But the input is wrong, so this step is marked incorrect