Limitations: What This Paper Cannot Do — Let's Verify Step by Step: A Process Supervision Approach to Reward Modeling

Process supervision is powerful, but it has real constraints. Here are the major limitations:

1. Step-Level Annotation Is Expensive

The bottleneck: To train a PRM, you need human annotators to go through every step of every solution. The PRM800K dataset required 800,000 step-level labels — a massive annotation effort.

The implication:

ORM: Mark the final answer correct/incorrect (~30 seconds per solution)
PRM: Read through the solution, mark each step (~5-10 minutes per solution)

If you want to scale RLHF to more domains (medical reasoning, legal analysis, coding), the PRM annotation cost becomes prohibitive. You cannot easily scale process supervision to problems with 20-step solutions or complex multi-part reasoning.

Potential mitigation: Automated step-level feedback (e.g., using a stronger model to annotate steps for a weaker model). But this introduces new errors.

2. Generalization Beyond Mathematical Reasoning Is Unknown

The constraint: The paper validates process supervision only on mathematical problems from the MATH benchmark. These problems have:

Clear, verifiable reasoning steps
Objective ground truth (an answer is right or wrong)
Well-defined mathematical notation
Compact solutions (typically 3-10 steps)

Unclear domains:

Medical reasoning: “Step 1: Patient has symptom X. Step 2: This could be disease A or B. Step 3: We should run test Z.” How do you mark these steps as correct? There’s no single ground truth.
Creative writing: “Step 1: Write an opening sentence. Step 2: Develop the plot.” These are not binary correct/incorrect.
Legal analysis: Complex reasoning about precedent and interpretation. Hard to label individual steps.
General coding: Is “initialize a variable” a correct step? Depends on the rest of the code.

The risk: PRM might be a mathematical reasoning-specific technique. It may not transfer to other domains where reasoning is more ambiguous or subjective.

3. The Product Score May Be Overly Conservative

The issue: Using $R_{\text{PRM}} = \prod_t p_t$, a single bad step tanks the entire score.

Example: A solution with steps [0.99, 0.99, 0.99, 0.02, 0.99, 0.99]: $$R_{\text{PRM}} = 0.99^5 \times 0.02 = 0.00959$$

Even though 5 out of 6 steps are excellent, the product is tiny because of one step with low confidence.

The problem: If the low-confidence step is a minor typo (e.g., “2.0001” instead of “2”) or a slightly informal phrasing (e.g., “approximately 3” instead of “3 + ε”), the entire solution gets a low score despite being nearly correct.

Alternative (minimum score): Using $R_{\text{PRM}} = \min_t p_t$ avoids this, but introduces a different problem: solutions with very few steps (1-2 steps) may score higher than detailed solutions just because there are fewer opportunities for error.

Current practice: The paper explores both, but the multiplication-based product score is more standard for probability combination.

4. Defining “Steps” Is Ambiguous

The ambiguity: What counts as a step?

Consider solving $x^2 - 1 = 0$:

Option A (coarse steps):

Step 1: Factor as (x-1)(x+1) = 0
Step 2: So x = 1 or x = -1

Option B (fine-grained steps):

Step 1: Recognize x² - 1 as a difference of squares
Step 2: Recall that a² - b² = (a-b)(a+b)
Step 3: Apply with a=x, b=1: (x-1)(x+1) = 0
Step 4: Set first factor to zero: x - 1 = 0, so x = 1
Step 5: Set second factor to zero: x + 1 = 0, so x = -1

The same solution decomposes into 2 steps or 5 steps depending on granularity.

Why this matters: If you use coarse steps, you lose information. If you use fine-grained steps, annotation becomes tedious and inconsistent (different annotators draw the line differently).

Current practice: The paper uses human judgment to define steps. This works for the MATH benchmark but doesn’t scale to automated step detection.

5. Credit Assignment Is Hard When Steps Are Interdependent

The issue: PRM assumes steps are relatively independent. It marks step i correct or incorrect given steps 1 through i-1.

But in some problems, steps are deeply interdependent:

Example: Solving a system of equations

Step 1: 2x + y = 5
Step 2: x - y = 1
Step 3: Add equations: 3x = 6, so x = 2
Step 4: Substitute back: 2 + y = 5, so y = 3

If step 3 is wrong (say, the human marks it wrong), does that make step 4 wrong? Step 4 is technically correct given the substitution, but the substitution is wrong.

The human annotator must decide: “Is step 4 correct logic, even though the input is wrong?” The answer is ambiguous.

The practical issue: When steps are interdependent, per-step correctness is not well-defined. This makes training a PRM harder in domains with complex dependencies (e.g., theorem proving, constraint satisfaction).

6. PRM Requires Ground Truth for Training

The requirement: To train a PRM, you need to know the correct answer and be able to verify each step. This works for math problems (you can compute the answer), but it’s harder for open-ended tasks.

Example: If a task is “write a poem about spring,” there’s no ground truth. You cannot train a PRM to judge if each step is correct, because there is no objective correctness.

The implication: PRM-style supervision is limited to domains where correctness is verifiable — math, code with test cases, factual Q&A, etc. It cannot easily scale to creative tasks or domains with subjective quality judgments.

7. The MATH Benchmark Itself Is Small

The data: PRM800K comes from ~8,000 problem solutions on the MATH benchmark (500 problems, ~16 solutions per problem). This is large by human annotation standards, but small by modern LLM training standards.

Questions:

Do the results generalize to other math problem distributions?
Would a PRM trained on MATH generalize to high school algebra, or only to competition math?
How does performance scale with more annotation data (1M, 10M step-level labels)?

The paper addresses this somewhat, but more diversity in problem sources would strengthen the claims.

8. The ORM Is Not State-of-the-Art

Context: The paper compares PRM to ORM (outcome reward model), but ORM itself is a simple baseline. More sophisticated baselines might exist:

Ensemble methods: Combine multiple ORM predictions
Fine-grained outcomes: Instead of binary correct/incorrect, score the final answer on a scale (e.g., 1-10)
Hybrid approaches: ORM + some per-step feedback without full step annotations

The paper does not deeply explore these alternatives, so it’s unclear whether the advantage of PRM over ORM comes from process supervision or just from having richer training data (per-step labels).

Summary

Process supervision is a significant step forward for mathematical reasoning, but it is not a universal solution. Its applicability depends on:

Whether step-level annotation is feasible (expensive, requires ground truth)
Whether the domain has verifiable step-level correctness (math: yes; poetry: no)
Whether steps are well-defined and relatively independent
Whether the problem space is well-defined and benchmarkable

For mathematical reasoning, these conditions hold. For broader domains, they may not.