Prerequisites
Read first: Paper 23 (Test-Time Compute), which explains Best-of-N and the basic idea of test-time compute. Also: Upper Confidence Bound.
UCB Formula for MCTS Node Selection
In Monte Carlo Tree Search, at each step you must decide: which partial solution should I explore next?
The Upper Confidence Bound (UCB) formula balances two goals:
- Exploitation: Go back to branches that have worked well (high average reward)
- Exploration: Try branches that haven’t been visited much (might be better!)
The formula is:
$$\text{UCB}(v) = \frac{Q(v)}{N(v)} + C \sqrt{\frac{\ln(N(\text{parent}))}{N(v)}}$$
Where:
- $Q(v)$ = total reward accumulated at node $v$ (sum of all rollout rewards)
- $N(v)$ = number of times node $v$ has been visited
- $N(\text{parent})$ = number of times the parent node has been visited
- $C$ = exploration constant (typically $\sqrt{2} \approx 1.414$)
Interpretation:
- First term $\frac{Q(v)}{N(v)}$ = average reward at this node. Exploitation: favor high-reward branches
- Second term $C \sqrt{\frac{\ln(N(\text{parent}))}{N(v)}}$ = exploration bonus. Penalizes nodes with many visits, rewards nodes with few visits
Worked Example: UCB in Action
Scenario: A MCTS search for a math problem. The parent node has been visited $N(\text{parent}) = 30$ times. Three child nodes represent different solution strategies:
Node A (cautious approach):
- $Q(A) = 8.4, N(A) = 12$
- $\text{UCB}(A) = \frac{8.4}{12} + 1.414 \times \sqrt{\frac{\ln(30)}{12}}$
- $= 0.700 + 1.414 \times \sqrt{\frac{3.401}{12}}$
- $= 0.700 + 1.414 \times \sqrt{0.283}$
- $= 0.700 + 1.414 \times 0.532$
- $= 0.700 + 0.752 = \mathbf{1.452}$
Node B (moderate approach):
- $Q(B) = 2.1, N(B) = 3$
- $\text{UCB}(B) = \frac{2.1}{3} + 1.414 \times \sqrt{\frac{\ln(30)}{3}}$
- $= 0.700 + 1.414 \times \sqrt{\frac{3.401}{3}}$
- $= 0.700 + 1.414 \times \sqrt{1.134}$
- $= 0.700 + 1.414 \times 1.065$
- $= 0.700 + 1.507 = \mathbf{2.207}$
Node C (bold approach):
- $Q(C) = 0.5, N(C) = 1$
- $\text{UCB}(C) = \frac{0.5}{1} + 1.414 \times \sqrt{\frac{\ln(30)}{1}}$
- $= 0.500 + 1.414 \times \sqrt{3.401}$
- $= 0.500 + 1.414 \times 1.844$
- $= 0.500 + 2.608 = \mathbf{3.108}$
MCTS selects Node C (highest UCB = 3.108).
Why? Even though Node A has the highest average reward (8.4/12 = 0.70), it’s been heavily explored. Node C barely touched (1 visit) deserves more attention. The exploration bonus for C overwhelms its lower average reward.
Key insight: MCTS prevents getting stuck in local optima by forcing exploration of under-visited branches.
PRM Step Scoring
The Process Reward Model outputs a score $p_i \in [0, 1]$ for each solution step $i$.
Step-level accuracy: How likely is step $i$ to be on the path to the correct answer?
The overall solution score can be computed as:
Product score: $$\text{score}{\text{product}} = \prod{i=1}^{n} p_i$$
This penalizes heavily if any step is weak. Example: if steps are [0.90, 0.85, 0.92], product score = 0.706.
Minimum score: $$\text{score}_{\text{min}} = \min_i p_i$$
This identifies the weakest link. Example: min([0.90, 0.85, 0.92]) = 0.85.
rStar-Math uses these scores to filter solutions in the MCTS rollout:
- Prune branches where $\text{score}_{\text{min}} < 0.5$ (weakest step is unreliable)
- Expand branches where $\text{score}_{\text{min}} > 0.8$ (all steps look good)
Round-by-Round Improvement Analysis
The paper reports accuracy across 4 rounds:
| Round | Accuracy | Improvement | Improvement % |
|---|---|---|---|
| 0 (base) | 42.0% | — | — |
| 1 | 68.4% | +26.4 pp | +62.9% relative |
| 2 | 78.2% | +9.8 pp | +14.3% relative |
| 3 | 85.3% | +7.1 pp | +9.1% relative |
| 4 | 90.0% | +4.7 pp | +5.5% relative |
Observation: Diminishing Returns
The improvement per round is decreasing: 26.4 → 9.8 → 7.1 → 4.7. This is characteristic of a bootstrapping process hitting saturation.
Analysis:
If we model the ceiling at ~92% (frontier performance) and assume geometric decay, the remaining improvement is:
- Round 0 remaining: 92% - 42% = 50 pp
- Round 1 remaining: 92% - 68.4% = 23.6 pp (reduction factor ≈ 0.47)
- Round 2 remaining: 92% - 78.2% = 13.8 pp (reduction factor ≈ 0.58)
- Round 3 remaining: 92% - 85.3% = 6.7 pp (reduction factor ≈ 0.49)
- Round 4 remaining: 92% - 90% = 2 pp
The reduction factor is roughly constant (~0.5), suggesting an exponential decay:
$$\text{Accuracy}_{\text{round } n} = 92 - 50 \times 0.5^n$$
Verify:
- Round 0: 92 - 50 = 42 ✓
- Round 1: 92 - 25 = 67 (actual: 68.4, close!)
- Round 2: 92 - 12.5 = 79.5 (actual: 78.2, close!)
- Round 4: 92 - 3.125 = 88.875 (actual: 90, reasonable)
Why diminishing returns?
- Early rounds: the model and PRM both improve dramatically
- Late rounds: MCTS has already found most of the good solutions in the dataset; further search yields redundant data
- Model capacity: a 7B model can’t reason perfectly; there’s a ceiling
Supervised Fine-Tuning (SFT) Loss
For each solution collected in the MCTS phase, we fine-tune the model using standard cross-entropy loss:
$$L_{\text{SFT}} = -\frac{1}{T} \sum_{t=1}^{T} \log P(\text{token}t | \text{tokens}{<t}, \text{problem})$$
Where:
- $T$ = total number of tokens in the solution
- $\text{token}_t$ = the $t$-th token in the correct solution
- $P(\text{token}_t | \ldots)$ = probability the model assigns to that token
Key: We only train on tokens from solutions that:
- Executed correctly (Python verification passed)
- Had high PRM scores (good reasoning quality)
This biases the model toward high-quality solutions.
Comparison to Best-of-N (Paper 23)
Recall from Paper 23: $P(\text{success with } N \text{ samples}) = 1 - (1-p)^N$
Best-of-N approach:
- Generate N=10 solutions from 42%-accurate base model
- Expected successes in N samples: $1 - 0.58^{10} \approx 99.8%$… sounds great!
- But: need a reliable PRM to pick the right one; many solutions are low-quality reasoning
MCTS approach:
- Generate fewer total samples, but guided by MCTS
- All selected solutions have high reasoning quality
- Training on these solutions directly improves the base model
- Result: model becomes 68% accurate on its own (doesn’t need test-time compute anymore)
rStar-Math is more ambitious: it’s not content to just improve inference accuracy; it improves the base model itself.
Why Four Rounds?
Why not 10 rounds? Or 2?
The paper chose 4 because:
- Round 1: Biggest improvements (data quality jump is largest)
- Rounds 2–3: Good improvements with reasonable compute
- Round 4: Hits diminishing returns; Round 5 would be minimal additional gain
- Computational budget: Each round requires running MCTS on 12,500 problems, which is expensive
A practical tradeoff: 4 rounds gives 90% accuracy with reasonable resources.
Parameter Counts and Compute
- Qwen-7B: 7 billion parameters
- PRM training: ~100–200 million parameters (much smaller)
- MCTS compute: Running MCTS search on 12,500 problems is the main bottleneck
Total training time: Estimated 1–2 GPU-months of compute per round, so 4–8 GPU-months total. For comparison, training a model from scratch is 100s of GPU-months.
Convergence and Stopping
The self-evolution loop stops when:
- Accuracy plateaus: Improvement per round falls below a threshold
- Compute budget exhausted: Resources run out
- PRM quality degrades: If PRM becomes less reliable, it might select worse solutions, and performance drops
In practice, 4 rounds is empirically optimal for the MATH dataset.