Worked Example: One Complete MCTS Rollout — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Let’s walk through one complete MCTS search for a specific math problem. This will show exactly how MCTS + PRM + Python verification work together.

The Problem

“How many positive integers less than 100 are divisible by 3 but not by 5?”

Expected answer: 27

(Explanation: Divisible by 3 below 100 = 33 numbers. Divisible by both 3 and 5 (i.e., by 15) = 6 numbers. Inclusion-exclusion: 33 - 6 = 27.)

Round 1 of MCTS Search

The MCTS tree is initially empty (just the root).

Step 1: Initialization (Root Node)

Node state: “Problem: count integers 1-99 divisible by 3 but not by 5”

The root has no children yet. MCTS will generate candidate next steps.

Step 2: Selection (First Expansion)

The model (Qwen-7B) generates three candidate first steps:

Branch A: Mathematical reasoning in natural language

“Multiples of 3 below 100: 3, 6, 9, …, 99. This is 99/3 = 33 numbers.”
State: A₁ (after this step, we’ve counted divisible-by-3)

Branch B: Direct Python calculation

“count_div3 = len([x for x in range(1, 100) if x % 3 == 0])”
State: B₁ (after this step, count_div3 is computed)

Branch C: Formula-based approach

“Use the formula for arithmetic sequences: a_n = a_1 + (n-1)d. Here a_1=3, d=3, a_n=99. Solve 99 = 3 + (n-1)×3 → n = 33.”
State: C₁ (after this step, we know the count is 33)

Step 3: PRM Evaluation of Candidates

PRM₁ (initially trained) scores each candidate:

Branch	Step Content	PRM Score	Reason
A	Natural language reasoning	0.75	Correct logic, but harder to verify
B	Python code	0.95	Unambiguous, auto-verifiable
C	Algebra	0.82	Correct, but more complex

MCTS selects Branch B (highest UCB, because it has few visits and high PRM score).

Step 4: Expand Branch B

Current solution trace:

count_div3 = len([x for x in range(1, 100) if x % 3 == 0])

The model generates the next logical step:

Candidate next steps:

“Now count multiples of 15 (divisible by both 3 and 5): count_div15 = len([x for x in range(1, 100) if x % 15 == 0])”
“Subtract immediately: answer = count_div3 - 6” (incorrect — we haven’t counted the 6 yet)
“Verify: run the code so far and print count_div3”

PRM scores:

Candidate 1: 0.92 (correct, logical next step)
Candidate 2: 0.40 (wrong — jumps to answer without intermediate step)
Candidate 3: 0.88 (correct but maybe redundant)

MCTS expands Candidate 1 (B→B1).

New solution state (B₁):

count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])

Step 5: Continue from B₁

Model generates the next step:

Candidate next steps:

“answer = count_div3 - count_div15” (correct final step)
“Check: are there any numbers divisible by 5 but not 3? We don’t need to subtract those.” (confused reasoning)

PRM scores:

Candidate 1: 0.91 (correct)
Candidate 2: 0.35 (confused logic)

MCTS expands Candidate 1 (B₁→B₂).

Final solution state (B₂):

count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])
answer = count_div3 - count_div15

Step 6: Verification (Leaf Node Evaluation)

Now we have a complete solution. Run the code:

count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])
answer = count_div3 - count_div15

Execution:

count_div3 = len([3, 6, 9, ..., 99]) = 33
count_div15 = len([15, 30, 45, 60, 75, 90]) = 6
answer = 33 - 6 = 27

Expected answer: 27. ✓ CORRECT!

Reward: +1 (correct solution with high-quality reasoning)

Step 7: Backup (Propagate Reward Up the Tree)

Update all nodes on the path with the reward:

Root (42% base accuracy)
  └─ B₁ "count_div3..." [PRM: 0.95]
      └─ B₁ → B₂ "count_div15..." [PRM: 0.92]
          └─ B₂ → answer... [PRM: 0.91]
              ✓ Rollout success: reward = 1

B₂ → answer: visits = 1, Q = 1
B₁ → B₂: visits = 1, Q = 1
Root → B: visits = 1, Q = 1

What the Training Data Looks Like

This MCTS rollout produces one training example:

Input (problem description): “How many positive integers less than 100 are divisible by 3 but not by 5?”

Output (solution trace, step-by-step):

# Step 1: Count multiples of 3
count_div3 = len([x for x in range(1, 100) if x % 3 == 0])

# Step 2: Count multiples of 15 (both 3 and 5)
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])

# Step 3: Use inclusion-exclusion
answer = count_div3 - count_div15

Step-level PRM annotations:

Step 1: [0.95] (very confident this is correct)
Step 2: [0.92] (confident this is correct)
Step 3: [0.91] (confident this is correct)

Final verification:

Execution result: answer = 27
Expected result: 27
Label: CORRECT ✓

What Happens After Multiple Rollouts?

In a real MCTS search, you don’t stop after one rollout. You continue:

Rollout 2: Same problem, but maybe the search explores Branch A or C this time (because B has been visited once, exploration bonus sends MCTS to less-explored branches).

Rollout 3-N: Keep exploring different paths. Eventually, MCTS converges:

Branch B consistently leads to correct answers
Branches A and C also work but are less efficient
The search tree settles on B as the best approach

After ~100 rollouts, MCTS has thoroughly explored the solution space and identified the highest-quality reasoning path(s).

Rounds 2–4: How the Model Improves

After Round 1:

We collect ~8,000 solutions like the one above from MCTS across all 12,500 problems
All solutions are CORRECT (verified by Python execution)
All have high PRM scores (good reasoning quality)
Train Qwen-7B on these 8,000 solutions → 68% accuracy

Now, in Round 2:

Run MCTS again, but the model is better (68% vs. 42%)
It generates better candidate steps (because it’s learned from Round 1 data)
MCTS finds even higher-quality solutions
These solutions train the model further → 78% accuracy

The cycle repeats.

Why This Example Works

For this specific problem:

Python verification is perfect: We can run the code and check the answer with 100% certainty
The solution is step-wise compositional: Each step builds on the previous one, and PRM can score each step
The model can learn the pattern: By Round 2, the model has seen similar divisibility problems and learned the inclusion-exclusion approach

Not all problems have these properties. Open-ended problems (write an essay, design a system) don’t have automatic verification, so the approach breaks down. That’s a key limitation (Section 7).

Verification by Hand

Let’s verify the answer manually to confirm:

Multiples of 3 below 100: 3, 6, 9, …, 99. Count = ⌊99/3⌋ = 33. ✓
Multiples of 15 below 100: 15, 30, 45, 60, 75, 90. Count = ⌊99/15⌋ = 6. ✓
Divisible by 3 but NOT by 5: 33 - 6 = 27. ✓

By hand: 3, 6, 9, 12, 18, 21, 24, 27, 33, 36, 39, 42, 48, 51, 54, 57, 63, 66, 69, 72, 78, 81, 84, 87, 93, 96, 99 = 27 ✓