The Mathematics Behind MCTS and Self-Evolution — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Prerequisites

Read first: Paper 23 (Test-Time Compute), which explains Best-of-N and the basic idea of test-time compute. Also: Upper Confidence Bound.

UCB Formula for MCTS Node Selection

In Monte Carlo Tree Search, at each step you must decide: which partial solution should I explore next?

The Upper Confidence Bound (UCB) formula balances two goals:

Exploitation: Go back to branches that have worked well (high average reward)
Exploration: Try branches that haven’t been visited much (might be better!)

The formula is:

$$\text{UCB}(v) = \frac{Q(v)}{N(v)} + C \sqrt{\frac{\ln(N(\text{parent}))}{N(v)}}$$

Where:

$Q(v)$ = total reward accumulated at node $v$ (sum of all rollout rewards)
$N(v)$ = number of times node $v$ has been visited
$N(\text{parent})$ = number of times the parent node has been visited
$C$ = exploration constant (typically $\sqrt{2} \approx 1.414$)

Interpretation:

First term $\frac{Q(v)}{N(v)}$ = average reward at this node. Exploitation: favor high-reward branches
Second term $C \sqrt{\frac{\ln(N(\text{parent}))}{N(v)}}$ = exploration bonus. Penalizes nodes with many visits, rewards nodes with few visits

Worked Example: UCB in Action

Scenario: A MCTS search for a math problem. The parent node has been visited $N(\text{parent}) = 30$ times. Three child nodes represent different solution strategies:

Node A (cautious approach):

$Q(A) = 8.4, N(A) = 12$
$\text{UCB}(A) = \frac{8.4}{12} + 1.414 \times \sqrt{\frac{\ln(30)}{12}}$
$= 0.700 + 1.414 \times \sqrt{\frac{3.401}{12}}$
$= 0.700 + 1.414 \times \sqrt{0.283}$
$= 0.700 + 1.414 \times 0.532$
$= 0.700 + 0.752 = \mathbf{1.452}$

Node B (moderate approach):

$Q(B) = 2.1, N(B) = 3$
$\text{UCB}(B) = \frac{2.1}{3} + 1.414 \times \sqrt{\frac{\ln(30)}{3}}$
$= 0.700 + 1.414 \times \sqrt{\frac{3.401}{3}}$
$= 0.700 + 1.414 \times \sqrt{1.134}$
$= 0.700 + 1.414 \times 1.065$
$= 0.700 + 1.507 = \mathbf{2.207}$

Node C (bold approach):

$Q(C) = 0.5, N(C) = 1$
$\text{UCB}(C) = \frac{0.5}{1} + 1.414 \times \sqrt{\frac{\ln(30)}{1}}$
$= 0.500 + 1.414 \times \sqrt{3.401}$
$= 0.500 + 1.414 \times 1.844$
$= 0.500 + 2.608 = \mathbf{3.108}$

MCTS selects Node C (highest UCB = 3.108).

Why? Even though Node A has the highest average reward (8.4/12 = 0.70), it’s been heavily explored. Node C barely touched (1 visit) deserves more attention. The exploration bonus for C overwhelms its lower average reward.

Key insight: MCTS prevents getting stuck in local optima by forcing exploration of under-visited branches.

PRM Step Scoring

The Process Reward Model outputs a score $p_i \in [0, 1]$ for each solution step $i$.

Step-level accuracy: How likely is step $i$ to be on the path to the correct answer?

The overall solution score can be computed as:

Product score: $$\text{score}{\text{product}} = \prod{i=1}^{n} p_i$$

This penalizes heavily if any step is weak. Example: if steps are [0.90, 0.85, 0.92], product score = 0.706.

Minimum score: $$\text{score}_{\text{min}} = \min_i p_i$$

This identifies the weakest link. Example: min([0.90, 0.85, 0.92]) = 0.85.

rStar-Math uses these scores to filter solutions in the MCTS rollout:

Prune branches where $\text{score}_{\text{min}} < 0.5$ (weakest step is unreliable)
Expand branches where $\text{score}_{\text{min}} > 0.8$ (all steps look good)

Round-by-Round Improvement Analysis

The paper reports accuracy across 4 rounds:

Round	Accuracy	Improvement	Improvement %
0 (base)	42.0%	—	—
1	68.4%	+26.4 pp	+62.9% relative
2	78.2%	+9.8 pp	+14.3% relative
3	85.3%	+7.1 pp	+9.1% relative
4	90.0%	+4.7 pp	+5.5% relative

Observation: Diminishing Returns

The improvement per round is decreasing: 26.4 → 9.8 → 7.1 → 4.7. This is characteristic of a bootstrapping process hitting saturation.

Analysis:

If we model the ceiling at ~92% (frontier performance) and assume geometric decay, the remaining improvement is:

Round 0 remaining: 92% - 42% = 50 pp
Round 1 remaining: 92% - 68.4% = 23.6 pp (reduction factor ≈ 0.47)
Round 2 remaining: 92% - 78.2% = 13.8 pp (reduction factor ≈ 0.58)
Round 3 remaining: 92% - 85.3% = 6.7 pp (reduction factor ≈ 0.49)
Round 4 remaining: 92% - 90% = 2 pp

The reduction factor is roughly constant (~0.5), suggesting an exponential decay:

$$\text{Accuracy}_{\text{round } n} = 92 - 50 \times 0.5^n$$

Verify:

Round 0: 92 - 50 = 42 ✓
Round 1: 92 - 25 = 67 (actual: 68.4, close!)
Round 2: 92 - 12.5 = 79.5 (actual: 78.2, close!)
Round 4: 92 - 3.125 = 88.875 (actual: 90, reasonable)

Why diminishing returns?

Early rounds: the model and PRM both improve dramatically
Late rounds: MCTS has already found most of the good solutions in the dataset; further search yields redundant data
Model capacity: a 7B model can’t reason perfectly; there’s a ceiling

Supervised Fine-Tuning (SFT) Loss

For each solution collected in the MCTS phase, we fine-tune the model using standard cross-entropy loss:

$$L_{\text{SFT}} = -\frac{1}{T} \sum_{t=1}^{T} \log P(\text{token}t | \text{tokens}{<t}, \text{problem})$$

Where:

$T$ = total number of tokens in the solution
$\text{token}_t$ = the $t$-th token in the correct solution
$P(\text{token}_t | \ldots)$ = probability the model assigns to that token

Key: We only train on tokens from solutions that:

Executed correctly (Python verification passed)
Had high PRM scores (good reasoning quality)

This biases the model toward high-quality solutions.

Comparison to Best-of-N (Paper 23)

Recall from Paper 23: $P(\text{success with } N \text{ samples}) = 1 - (1-p)^N$

Best-of-N approach:

Generate N=10 solutions from 42%-accurate base model
Expected successes in N samples: $1 - 0.58^{10} \approx 99.8%$… sounds great!
But: need a reliable PRM to pick the right one; many solutions are low-quality reasoning

MCTS approach:

Generate fewer total samples, but guided by MCTS
All selected solutions have high reasoning quality
Training on these solutions directly improves the base model
Result: model becomes 68% accurate on its own (doesn’t need test-time compute anymore)

rStar-Math is more ambitious: it’s not content to just improve inference accuracy; it improves the base model itself.

Why Four Rounds?

Why not 10 rounds? Or 2?

The paper chose 4 because:

Round 1: Biggest improvements (data quality jump is largest)
Rounds 2–3: Good improvements with reasonable compute
Round 4: Hits diminishing returns; Round 5 would be minimal additional gain
Computational budget: Each round requires running MCTS on 12,500 problems, which is expensive

A practical tradeoff: 4 rounds gives 90% accuracy with reasonable resources.

Parameter Counts and Compute

Qwen-7B: 7 billion parameters
PRM training: ~100–200 million parameters (much smaller)
MCTS compute: Running MCTS search on 12,500 problems is the main bottleneck

Total training time: Estimated 1–2 GPU-months of compute per round, so 4–8 GPU-months total. For comparison, training a model from scratch is 100s of GPU-months.

Convergence and Stopping

The self-evolution loop stops when:

Accuracy plateaus: Improvement per round falls below a threshold
Compute budget exhausted: Resources run out
PRM quality degrades: If PRM becomes less reliable, it might select worse solutions, and performance drops

In practice, 4 rounds is empirically optimal for the MATH dataset.