Limitations: When Self-Evolution Breaks Down — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

rStar-Math is impressive, but it rests on specific assumptions. Here’s what it doesn’t do well.

1. Requires Automatic Verification

The entire approach depends on being able to check solutions automatically. For math, Python execution works perfectly — run the code, does it give the right answer?

But for most real-world problems, this breaks:

Medical diagnosis: A treatment plan might help the patient, but you won’t know for months. You can’t auto-verify.
Legal interpretation: Is a contract clause interpreted correctly? Requires human lawyers to judge.
Creative writing: Is the story good? Subjective. No auto-verifier.
Scientific research: Is the hypothesis interesting? Requires peer review.

For these domains, you’d need human annotators to label solutions, bringing back the cost and latency the paper was trying to avoid.

rStar-Math works in a narrow domain: problems with objectively verifiable solutions (math, code, formal logic). Outside this domain, the bootstrapping loop breaks.

2. Assumes Solutions are Decomposable into Steps

The Process Reward Model scores individual steps. This assumes solutions naturally decompose into a sequence of steps, each of which can be judged.

For math, this is true: “count multiples of 3” → “subtract multiples of 15” → “output answer.” Clear steps.

But for other tasks:

Ambiguous decomposition: In essay writing, there’s no clear “Step 1, Step 2, Step 3.” Ideas flow and overlap.
Interdependent steps: In a chess game, each move depends on all previous moves; isolating one step loses context.
Emergent properties: In creative work, the magic is often in how disparate elements combine, not in individual steps.

For these tasks, step-level PRM training becomes hard. You might only be able to score the final output (Outcome Reward Model, ORM), which is less informative and makes MCTS guidance weaker.

3. Doesn’t Handle Missing Knowledge

Test-time compute and self-evolution help with reasoning, but not recall.

If the model doesn’t know that “the square root of 2 is approximately 1.414,” more thinking time won’t add that knowledge. You’d still get the question wrong.

Example:

Reasoning problem: “Prove that sqrt(2) is irrational.” Even 42%-accuracy models can make good attempts. With MCTS + training, they improve to 90%.
Recall problem: “What is the capital of Mongolia?” Either the model knows (Ulaanbaatar) or it doesn’t. More thinking doesn’t help.

rStar-Math is strongest on problems mixing reasoning and recall. Pure-recall benchmarks (factual QA, open-book exams) see less benefit.

4. PRM Quality Bootstrapping Problem

Round 1 must use an initial PRM, trained on limited data or a heuristic. If this PRM is poor, MCTS selects poor solutions, and training degrades.

Example: If PRM₁ has a 20% error rate and incorrectly scores 20% of wrong solutions as correct, Round 1 training data gets polluted with bad examples. The model trained on this data might actually degrade.

The question: How do you bootstrap a PRM on a task where ground truth is expensive?

For math, you can use code execution (ground truth is free). For other domains, you might not have such a check. You might need:

Human annotation to train initial PRM (expensive)
Or a weaker heuristic PRM that’s only 70% accurate (limits quality of early rounds)

5. Computational Cost of MCTS Search

Running MCTS on 12,500 problems is expensive. Each problem:

Generate candidate solutions
Score with PRM
Run Python verification
Update search tree

Multiply by multiple rounds. rStar-Math’s success requires significant GPU compute, making it accessible only to labs with resources.

For practitioners without such resources, simpler approaches (rejection sampling, synthetic data) might be more practical, even if they achieve slightly lower accuracy.

6. Architectural Limits of 7B Parameters

Even with optimal training data, a 7B model has limits. rStar-Math reaches 90% on MATH, but:

Some AIME problems are genuinely novel (require insights humans rarely have)
A 7B model can’t store and reason with arbitrarily complex symbolic systems
Physical intuition, visual reasoning, etc., might require larger models

Larger models (70B, 100B+) might have higher ceilings. The question: would self-evolution on a 70B model reach 95%? 98%? There may be fundamental limits.

7. Narrow Task Domain

The paper’s strongest results are on MATH — a specific benchmark. Generalization to other domains is unclear:

Code generation: Does self-evolution work? Code has auto-verification (tests), but solutions are diverse; step-level scoring is less clear.
Reasoning across domains: Science problems, logic puzzles, multi-step planning. Do PRMs trained on MATH transfer to these?
Real-world problems: Most real problems are open-ended, not like MATH. The assumption of binary correctness breaks.

The paper demonstrates the approach on one domain. Whether it generalizes is an open question.

8. Diminishing Returns and Saturation

The improvements per round decrease: 26% → 10% → 7% → 5%. By Round 5, you might see <2% improvement. The compute cost per percentage point rises exponentially.

At what point does the cost of additional rounds exceed the benefit? For MATH, the paper chose 4 rounds. But for other tasks with different saturation curves, this might be suboptimal.

There’s also a hard ceiling imposed by:

The base model’s architecture (can it reason this complex?)
The dataset’s difficulty (are there problems the model fundamentally can’t solve?)

rStar-Math doesn’t magically break these ceilings; it just gets you closer to them.

9. Reproducibility and Baseline Sensitivity

The paper compares to base Qwen-7B. But:

What if you start from a better pre-trained base (e.g., a model already fine-tuned on code)?
What if you start from a weaker base?

The delta (improvement per round) might vary. The paper doesn’t thoroughly ablate the sensitivity to starting conditions.

Also, running MCTS with different random seeds, different PRM initializations, etc., might produce different results. The paper should report variance; most papers don’t.

The Bigger Picture

These limitations point to a deeper insight: rStar-Math works because math is special.

Math has:

Objective correctness (right or wrong, no ambiguity)
Automatic verification (Python execution)
Step-wise decomposability (reasoning naturally breaks into steps)
Clear ground truth (the answer is either correct or not)

Most real-world problems lack one or more of these. The approach doesn’t magically generalize.

The valuable lesson: When these conditions hold, self-evolution is powerful. When they don’t, you need different strategies. rStar-Math is a proof-of-concept for a specific domain, not a universal solution.