The Idea: Self-Evolution Through Search and Training — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

The Four-Round Loop

rStar-Math works through four consecutive rounds, each following the same pattern:

Search: Use MCTS guided by a PRM to generate high-quality solutions
Verify: Run solutions as Python code — automatic, no human judgment
Select: Keep only correct solutions with high PRM scores
Train: Supervised fine-tune the model on this selected data
Improve: Also train a better PRM for the next round
Repeat: The improved model generates better data, which trains an even better model

Let’s walk through each round carefully.

Round 1: Starting from Weakness

Input: Base Qwen-7B model, ~42% accuracy on MATH.

MCTS Search: For each problem in the MATH dataset:

Initialize an empty solution (root node of the search tree)
Repeatedly:
- Selection: Use UCB to pick which partial solution to expand next
  - Promising branches (high PRM score, few visits) get explored
  - Poor branches (low PRM score, many visits) are abandoned
- Expansion: Ask the model: “What’s the next reasoning step?”
- Simulation: Complete the solution (either through rollout or direct execution)
- Evaluation: Run the Python code. Did we get the right answer? (+1 reward) or wrong? (0 reward)
- Backup: Update all nodes on the path with the outcome
Repeat until convergence (the tree explores many branches and settles on good ones)

Example: For a divisibility problem, the MCTS tree explores:

Branch A: “Count multiples of 3 using range(3, 100, 3)”
Branch B: “Count using list comprehension”
Branch C: “Use the formula floor(99/3)”

The PRM scores each, and MCTS focuses on the most promising branches.

Python Verification: Once MCTS finds a candidate solution (a complete Python program), run it:

exec(solution_code)
if answer == 27:  # correct answer
    label = "CORRECT"

No human judgment. Just execution.

Data Selection: Collect all solutions where:

Python execution returned the correct answer
PRM scores on intermediate steps were high (model reasoned well)

This gives us, say, 8,000 high-quality solutions (out of 12,500 problems).

Supervised Fine-Tuning: Train Qwen-7B on these 8,000 solutions: $$L_{\text{SFT}} = -\frac{1}{T} \sum_{t} \log P(\text{token}t | \text{tokens}{<t})$$

Only the tokens from correct, high-quality solutions contribute to the loss.

PRM₂ Training: Also train a new PRM₂ on step-level annotations from the MCTS data. PRM₁ was basic; PRM₂ is better because trained on more and better data.

Result: ~68% accuracy on MATH. A 26 percentage point jump from 42%!

Round 2: Bootstrapping Improvement

Input: Improved Qwen-7B (now 68% accurate), PRM₂ (better at scoring steps).

The Loop Repeats: Run MCTS again, but now:

The model generates better candidate solutions (because it’s better)
The PRM guides the search more accurately (because it’s better)
Together, they find even higher-quality solutions

Result: ~78% accuracy. +10 percentage points.

Round 3 & 4: Cascading Gains

The same pattern continues:

Round 3: 78% → 85%
Round 4: 85% → 90% (matches o1-preview)

Why does this keep improving?

Better model → generates better solution candidates
Better candidates + better PRM → MCTS finds higher-quality solutions
Higher-quality solutions → better training data → even better model

Why does it eventually plateau?

Diminishing returns: the improvement per round decreases (26% → 10% → 7% → 5%)
The base model hits architectural limits (7B parameters can’t reason beyond a certain complexity)
The MCTS explores less of the space as more solutions are already found

Why This Works: Three Key Ingredients

1. MCTS is Efficient Exploration

Instead of generating N random solutions (Best-of-N), MCTS intelligently explores the solution space:

Uses PRM scores to focus on promising branches
Prunes obviously bad paths
Converges to good solutions with fewer total rollouts

For the same compute budget, MCTS finds higher-quality solutions than random sampling.

2. Python Verification is Automatic and Scalable

No human annotators needed. No subjective judgment. Just: does the code produce the right output?

This enables:

Unlimited scale (generate as many solutions as you want)
High fidelity (automatic execution is perfect, unlike humans who make mistakes)
No labelling cost (no human time, no wage expenses)

3. Self-Evolution Closes the Loop

The model learns from its own search:

Good solutions feed back as training data
The model improves, generating better solutions
A virtuous cycle forms

Each round is a small improvement, but across 4 rounds, they compound.

Program-of-Thought (PoT) vs. Chain-of-Thought (CoT)

Chain-of-Thought (CoT): Natural language reasoning

Let me count the multiples of 3 below 100.
They are 3, 6, 9, ..., 99.
This is an arithmetic sequence with first term 3 and common difference 3.
The number of terms is 99/3 = 33.

Program-of-Thought (PoT): Python code

# Count multiples of 3 below 100
count = len([x for x in range(1, 100) if x % 3 == 0])
answer = count

Why PoT is better for this task:

Verification: You can execute PoT and check the result. CoT requires manual reading and judgment.
Precision: Code is unambiguous. English reasoning can be hard to follow.
Scalability: Execution is automatic. Manual checking doesn’t scale.

rStar-Math primarily uses PoT, though intermediate reasoning traces are still valuable (the PRM scores steps, not just final answers).

The Virtuous Cycle (Simplified)

Weak Model (42%) 
    ↓
MCTS + PRM₁ find good solutions
    ↓
Train on solutions → Better Model (68%)
    ↓
Better Model + PRM₂ find better solutions
    ↓
Train on solutions → Even Better Model (78%)
    ↓
Better Model + PRM₃ find even better solutions
    ↓
Train on solutions → Best Model (90%)

Each iteration, both the model and the verifier improve, unlocking better search for the next iteration.

Comparison to Other Approaches

Approach	Cost	Accessibility	Scalability
Train GPT-4 from scratch	$100M+	OpenAI only	Very limited (few labs)
Use Best-of-N (Paper 23)	Moderate (many attempts)	Any lab	Hits ceiling as base accuracy is low
Self-evolution (rStar-Math)	Reasonable (~$1M)	Any research lab	Scales via bootstrapping

rStar-Math finds a sweet spot: moderate cost, accessible implementation, strong results through bootstrapping.

Key Assumptions

For this approach to work:

Problem domain must be automatically verifiable (math via Python works; open-ended writing doesn’t)
Solutions must be decomposable into steps (so PRM can score intermediate reasoning)
Base model must be capable enough (7B is near the limit; 1B would struggle more)
PRM initialization must be reasonable (you need some initial signal to start bootstrapping)

All four hold for math reasoning. Not all domains are so fortunate.