Section 03

The Idea: Self-Evolution Through Search and Training

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking 2025

The Four-Round Loop

rStar-Math works through four consecutive rounds, each following the same pattern:

  1. Search: Use MCTS guided by a PRM to generate high-quality solutions
  2. Verify: Run solutions as Python code — automatic, no human judgment
  3. Select: Keep only correct solutions with high PRM scores
  4. Train: Supervised fine-tune the model on this selected data
  5. Improve: Also train a better PRM for the next round
  6. Repeat: The improved model generates better data, which trains an even better model

Let’s walk through each round carefully.


Round 1: Starting from Weakness

Input: Base Qwen-7B model, ~42% accuracy on MATH.

MCTS Search: For each problem in the MATH dataset:

  • Initialize an empty solution (root node of the search tree)
  • Repeatedly:
    • Selection: Use UCB to pick which partial solution to expand next
      • Promising branches (high PRM score, few visits) get explored
      • Poor branches (low PRM score, many visits) are abandoned
    • Expansion: Ask the model: “What’s the next reasoning step?”
    • Simulation: Complete the solution (either through rollout or direct execution)
    • Evaluation: Run the Python code. Did we get the right answer? (+1 reward) or wrong? (0 reward)
    • Backup: Update all nodes on the path with the outcome
  • Repeat until convergence (the tree explores many branches and settles on good ones)

Example: For a divisibility problem, the MCTS tree explores:

  • Branch A: “Count multiples of 3 using range(3, 100, 3)”
  • Branch B: “Count using list comprehension”
  • Branch C: “Use the formula floor(99/3)”

The PRM scores each, and MCTS focuses on the most promising branches.

Python Verification: Once MCTS finds a candidate solution (a complete Python program), run it:

exec(solution_code)
if answer == 27:  # correct answer
    label = "CORRECT"

No human judgment. Just execution.

Data Selection: Collect all solutions where:

  • Python execution returned the correct answer
  • PRM scores on intermediate steps were high (model reasoned well)

This gives us, say, 8,000 high-quality solutions (out of 12,500 problems).

Supervised Fine-Tuning: Train Qwen-7B on these 8,000 solutions: $$L_{\text{SFT}} = -\frac{1}{T} \sum_{t} \log P(\text{token}t | \text{tokens}{<t})$$

Only the tokens from correct, high-quality solutions contribute to the loss.

PRM₂ Training: Also train a new PRM₂ on step-level annotations from the MCTS data. PRM₁ was basic; PRM₂ is better because trained on more and better data.

Result: ~68% accuracy on MATH. A 26 percentage point jump from 42%!


Round 2: Bootstrapping Improvement

Input: Improved Qwen-7B (now 68% accurate), PRM₂ (better at scoring steps).

The Loop Repeats: Run MCTS again, but now:

  • The model generates better candidate solutions (because it’s better)
  • The PRM guides the search more accurately (because it’s better)
  • Together, they find even higher-quality solutions

Result: ~78% accuracy. +10 percentage points.


Round 3 & 4: Cascading Gains

The same pattern continues:

  • Round 3: 78% → 85%
  • Round 4: 85% → 90% (matches o1-preview)

Why does this keep improving?

  • Better model → generates better solution candidates
  • Better candidates + better PRM → MCTS finds higher-quality solutions
  • Higher-quality solutions → better training data → even better model

Why does it eventually plateau?

  • Diminishing returns: the improvement per round decreases (26% → 10% → 7% → 5%)
  • The base model hits architectural limits (7B parameters can’t reason beyond a certain complexity)
  • The MCTS explores less of the space as more solutions are already found

Why This Works: Three Key Ingredients

1. MCTS is Efficient Exploration

Instead of generating N random solutions (Best-of-N), MCTS intelligently explores the solution space:

  • Uses PRM scores to focus on promising branches
  • Prunes obviously bad paths
  • Converges to good solutions with fewer total rollouts

For the same compute budget, MCTS finds higher-quality solutions than random sampling.

2. Python Verification is Automatic and Scalable

No human annotators needed. No subjective judgment. Just: does the code produce the right output?

This enables:

  • Unlimited scale (generate as many solutions as you want)
  • High fidelity (automatic execution is perfect, unlike humans who make mistakes)
  • No labelling cost (no human time, no wage expenses)

3. Self-Evolution Closes the Loop

The model learns from its own search:

  • Good solutions feed back as training data
  • The model improves, generating better solutions
  • A virtuous cycle forms

Each round is a small improvement, but across 4 rounds, they compound.


Program-of-Thought (PoT) vs. Chain-of-Thought (CoT)

Chain-of-Thought (CoT): Natural language reasoning

Let me count the multiples of 3 below 100.
They are 3, 6, 9, ..., 99.
This is an arithmetic sequence with first term 3 and common difference 3.
The number of terms is 99/3 = 33.

Program-of-Thought (PoT): Python code

# Count multiples of 3 below 100
count = len([x for x in range(1, 100) if x % 3 == 0])
answer = count

Why PoT is better for this task:

  • Verification: You can execute PoT and check the result. CoT requires manual reading and judgment.
  • Precision: Code is unambiguous. English reasoning can be hard to follow.
  • Scalability: Execution is automatic. Manual checking doesn’t scale.

rStar-Math primarily uses PoT, though intermediate reasoning traces are still valuable (the PRM scores steps, not just final answers).


The Virtuous Cycle (Simplified)

Weak Model (42%) 

MCTS + PRM₁ find good solutions

Train on solutions → Better Model (68%)

Better Model + PRM₂ find better solutions

Train on solutions → Even Better Model (78%)

Better Model + PRM₃ find even better solutions

Train on solutions → Best Model (90%)

Each iteration, both the model and the verifier improve, unlocking better search for the next iteration.


Comparison to Other Approaches

ApproachCostAccessibilityScalability
Train GPT-4 from scratch$100M+OpenAI onlyVery limited (few labs)
Use Best-of-N (Paper 23)Moderate (many attempts)Any labHits ceiling as base accuracy is low
Self-evolution (rStar-Math)Reasonable (~$1M)Any research labScales via bootstrapping

rStar-Math finds a sweet spot: moderate cost, accessible implementation, strong results through bootstrapping.


Key Assumptions

For this approach to work:

  1. Problem domain must be automatically verifiable (math via Python works; open-ended writing doesn’t)
  2. Solutions must be decomposable into steps (so PRM can score intermediate reasoning)
  3. Base model must be capable enough (7B is near the limit; 1B would struggle more)
  4. PRM initialization must be reasonable (you need some initial signal to start bootstrapping)

All four hold for math reasoning. Not all domains are so fortunate.