The Four-Round Loop
rStar-Math works through four consecutive rounds, each following the same pattern:
- Search: Use MCTS guided by a PRM to generate high-quality solutions
- Verify: Run solutions as Python code — automatic, no human judgment
- Select: Keep only correct solutions with high PRM scores
- Train: Supervised fine-tune the model on this selected data
- Improve: Also train a better PRM for the next round
- Repeat: The improved model generates better data, which trains an even better model
Let’s walk through each round carefully.
Round 1: Starting from Weakness
Input: Base Qwen-7B model, ~42% accuracy on MATH.
MCTS Search: For each problem in the MATH dataset:
- Initialize an empty solution (root node of the search tree)
- Repeatedly:
- Selection: Use UCB to pick which partial solution to expand next
- Promising branches (high PRM score, few visits) get explored
- Poor branches (low PRM score, many visits) are abandoned
- Expansion: Ask the model: “What’s the next reasoning step?”
- Simulation: Complete the solution (either through rollout or direct execution)
- Evaluation: Run the Python code. Did we get the right answer? (+1 reward) or wrong? (0 reward)
- Backup: Update all nodes on the path with the outcome
- Selection: Use UCB to pick which partial solution to expand next
- Repeat until convergence (the tree explores many branches and settles on good ones)
Example: For a divisibility problem, the MCTS tree explores:
- Branch A: “Count multiples of 3 using range(3, 100, 3)”
- Branch B: “Count using list comprehension”
- Branch C: “Use the formula floor(99/3)”
The PRM scores each, and MCTS focuses on the most promising branches.
Python Verification: Once MCTS finds a candidate solution (a complete Python program), run it:
exec(solution_code)
if answer == 27: # correct answer
label = "CORRECT"
No human judgment. Just execution.
Data Selection: Collect all solutions where:
- Python execution returned the correct answer
- PRM scores on intermediate steps were high (model reasoned well)
This gives us, say, 8,000 high-quality solutions (out of 12,500 problems).
Supervised Fine-Tuning: Train Qwen-7B on these 8,000 solutions: $$L_{\text{SFT}} = -\frac{1}{T} \sum_{t} \log P(\text{token}t | \text{tokens}{<t})$$
Only the tokens from correct, high-quality solutions contribute to the loss.
PRM₂ Training: Also train a new PRM₂ on step-level annotations from the MCTS data. PRM₁ was basic; PRM₂ is better because trained on more and better data.
Result: ~68% accuracy on MATH. A 26 percentage point jump from 42%!
Round 2: Bootstrapping Improvement
Input: Improved Qwen-7B (now 68% accurate), PRM₂ (better at scoring steps).
The Loop Repeats: Run MCTS again, but now:
- The model generates better candidate solutions (because it’s better)
- The PRM guides the search more accurately (because it’s better)
- Together, they find even higher-quality solutions
Result: ~78% accuracy. +10 percentage points.
Round 3 & 4: Cascading Gains
The same pattern continues:
- Round 3: 78% → 85%
- Round 4: 85% → 90% (matches o1-preview)
Why does this keep improving?
- Better model → generates better solution candidates
- Better candidates + better PRM → MCTS finds higher-quality solutions
- Higher-quality solutions → better training data → even better model
Why does it eventually plateau?
- Diminishing returns: the improvement per round decreases (26% → 10% → 7% → 5%)
- The base model hits architectural limits (7B parameters can’t reason beyond a certain complexity)
- The MCTS explores less of the space as more solutions are already found
Why This Works: Three Key Ingredients
1. MCTS is Efficient Exploration
Instead of generating N random solutions (Best-of-N), MCTS intelligently explores the solution space:
- Uses PRM scores to focus on promising branches
- Prunes obviously bad paths
- Converges to good solutions with fewer total rollouts
For the same compute budget, MCTS finds higher-quality solutions than random sampling.
2. Python Verification is Automatic and Scalable
No human annotators needed. No subjective judgment. Just: does the code produce the right output?
This enables:
- Unlimited scale (generate as many solutions as you want)
- High fidelity (automatic execution is perfect, unlike humans who make mistakes)
- No labelling cost (no human time, no wage expenses)
3. Self-Evolution Closes the Loop
The model learns from its own search:
- Good solutions feed back as training data
- The model improves, generating better solutions
- A virtuous cycle forms
Each round is a small improvement, but across 4 rounds, they compound.
Program-of-Thought (PoT) vs. Chain-of-Thought (CoT)
Chain-of-Thought (CoT): Natural language reasoning
Let me count the multiples of 3 below 100.
They are 3, 6, 9, ..., 99.
This is an arithmetic sequence with first term 3 and common difference 3.
The number of terms is 99/3 = 33.
Program-of-Thought (PoT): Python code
# Count multiples of 3 below 100
count = len([x for x in range(1, 100) if x % 3 == 0])
answer = count
Why PoT is better for this task:
- Verification: You can execute PoT and check the result. CoT requires manual reading and judgment.
- Precision: Code is unambiguous. English reasoning can be hard to follow.
- Scalability: Execution is automatic. Manual checking doesn’t scale.
rStar-Math primarily uses PoT, though intermediate reasoning traces are still valuable (the PRM scores steps, not just final answers).
The Virtuous Cycle (Simplified)
Weak Model (42%)
↓
MCTS + PRM₁ find good solutions
↓
Train on solutions → Better Model (68%)
↓
Better Model + PRM₂ find better solutions
↓
Train on solutions → Even Better Model (78%)
↓
Better Model + PRM₃ find even better solutions
↓
Train on solutions → Best Model (90%)
Each iteration, both the model and the verifier improve, unlocking better search for the next iteration.
Comparison to Other Approaches
| Approach | Cost | Accessibility | Scalability |
|---|---|---|---|
| Train GPT-4 from scratch | $100M+ | OpenAI only | Very limited (few labs) |
| Use Best-of-N (Paper 23) | Moderate (many attempts) | Any lab | Hits ceiling as base accuracy is low |
| Self-evolution (rStar-Math) | Reasonable (~$1M) | Any research lab | Scales via bootstrapping |
rStar-Math finds a sweet spot: moderate cost, accessible implementation, strong results through bootstrapping.
Key Assumptions
For this approach to work:
- Problem domain must be automatically verifiable (math via Python works; open-ended writing doesn’t)
- Solutions must be decomposable into steps (so PRM can score intermediate reasoning)
- Base model must be capable enough (7B is near the limit; 1B would struggle more)
- PRM initialization must be reasonable (you need some initial signal to start bootstrapping)
All four hold for math reasoning. Not all domains are so fortunate.