Section 05

Worked Example: One Complete MCTS Rollout

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking 2025

Let’s walk through one complete MCTS search for a specific math problem. This will show exactly how MCTS + PRM + Python verification work together.


The Problem

“How many positive integers less than 100 are divisible by 3 but not by 5?”

Expected answer: 27

(Explanation: Divisible by 3 below 100 = 33 numbers. Divisible by both 3 and 5 (i.e., by 15) = 6 numbers. Inclusion-exclusion: 33 - 6 = 27.)


The MCTS tree is initially empty (just the root).

Step 1: Initialization (Root Node)

Node state: “Problem: count integers 1-99 divisible by 3 but not by 5”

The root has no children yet. MCTS will generate candidate next steps.


Step 2: Selection (First Expansion)

The model (Qwen-7B) generates three candidate first steps:

Branch A: Mathematical reasoning in natural language

  • “Multiples of 3 below 100: 3, 6, 9, …, 99. This is 99/3 = 33 numbers.”
  • State: A₁ (after this step, we’ve counted divisible-by-3)

Branch B: Direct Python calculation

  • “count_div3 = len([x for x in range(1, 100) if x % 3 == 0])”
  • State: B₁ (after this step, count_div3 is computed)

Branch C: Formula-based approach

  • “Use the formula for arithmetic sequences: a_n = a_1 + (n-1)d. Here a_1=3, d=3, a_n=99. Solve 99 = 3 + (n-1)×3 → n = 33.”
  • State: C₁ (after this step, we know the count is 33)

Step 3: PRM Evaluation of Candidates

PRM₁ (initially trained) scores each candidate:

BranchStep ContentPRM ScoreReason
ANatural language reasoning0.75Correct logic, but harder to verify
BPython code0.95Unambiguous, auto-verifiable
CAlgebra0.82Correct, but more complex

MCTS selects Branch B (highest UCB, because it has few visits and high PRM score).


Step 4: Expand Branch B

Current solution trace:

count_div3 = len([x for x in range(1, 100) if x % 3 == 0])

The model generates the next logical step:

Candidate next steps:

  1. “Now count multiples of 15 (divisible by both 3 and 5): count_div15 = len([x for x in range(1, 100) if x % 15 == 0])”
  2. “Subtract immediately: answer = count_div3 - 6” (incorrect — we haven’t counted the 6 yet)
  3. “Verify: run the code so far and print count_div3”

PRM scores:

  • Candidate 1: 0.92 (correct, logical next step)
  • Candidate 2: 0.40 (wrong — jumps to answer without intermediate step)
  • Candidate 3: 0.88 (correct but maybe redundant)

MCTS expands Candidate 1 (B→B1).

New solution state (B₁):

count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])

Step 5: Continue from B₁

Model generates the next step:

Candidate next steps:

  1. “answer = count_div3 - count_div15” (correct final step)
  2. “Check: are there any numbers divisible by 5 but not 3? We don’t need to subtract those.” (confused reasoning)

PRM scores:

  • Candidate 1: 0.91 (correct)
  • Candidate 2: 0.35 (confused logic)

MCTS expands Candidate 1 (B₁→B₂).

Final solution state (B₂):

count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])
answer = count_div3 - count_div15

Step 6: Verification (Leaf Node Evaluation)

Now we have a complete solution. Run the code:

count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])
answer = count_div3 - count_div15

Execution:

  • count_div3 = len([3, 6, 9, ..., 99]) = 33
  • count_div15 = len([15, 30, 45, 60, 75, 90]) = 6
  • answer = 33 - 6 = 27

Expected answer: 27. ✓ CORRECT!

Reward: +1 (correct solution with high-quality reasoning)


Step 7: Backup (Propagate Reward Up the Tree)

Update all nodes on the path with the reward:

Root (42% base accuracy)
  └─ B₁ "count_div3..." [PRM: 0.95]
      └─ B₁ → B₂ "count_div15..." [PRM: 0.92]
          └─ B₂ → answer... [PRM: 0.91]
              ✓ Rollout success: reward = 1
  • B₂ → answer: visits = 1, Q = 1
  • B₁ → B₂: visits = 1, Q = 1
  • Root → B: visits = 1, Q = 1

What the Training Data Looks Like

This MCTS rollout produces one training example:

Input (problem description): “How many positive integers less than 100 are divisible by 3 but not by 5?”

Output (solution trace, step-by-step):

# Step 1: Count multiples of 3
count_div3 = len([x for x in range(1, 100) if x % 3 == 0])

# Step 2: Count multiples of 15 (both 3 and 5)
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])

# Step 3: Use inclusion-exclusion
answer = count_div3 - count_div15

Step-level PRM annotations:

  • Step 1: [0.95] (very confident this is correct)
  • Step 2: [0.92] (confident this is correct)
  • Step 3: [0.91] (confident this is correct)

Final verification:

  • Execution result: answer = 27
  • Expected result: 27
  • Label: CORRECT ✓

What Happens After Multiple Rollouts?

In a real MCTS search, you don’t stop after one rollout. You continue:

Rollout 2: Same problem, but maybe the search explores Branch A or C this time (because B has been visited once, exploration bonus sends MCTS to less-explored branches).

Rollout 3-N: Keep exploring different paths. Eventually, MCTS converges:

  • Branch B consistently leads to correct answers
  • Branches A and C also work but are less efficient
  • The search tree settles on B as the best approach

After ~100 rollouts, MCTS has thoroughly explored the solution space and identified the highest-quality reasoning path(s).


Rounds 2–4: How the Model Improves

After Round 1:

  1. We collect ~8,000 solutions like the one above from MCTS across all 12,500 problems
  2. All solutions are CORRECT (verified by Python execution)
  3. All have high PRM scores (good reasoning quality)
  4. Train Qwen-7B on these 8,000 solutions → 68% accuracy

Now, in Round 2:

  1. Run MCTS again, but the model is better (68% vs. 42%)
  2. It generates better candidate steps (because it’s learned from Round 1 data)
  3. MCTS finds even higher-quality solutions
  4. These solutions train the model further → 78% accuracy

The cycle repeats.


Why This Example Works

For this specific problem:

  1. Python verification is perfect: We can run the code and check the answer with 100% certainty
  2. The solution is step-wise compositional: Each step builds on the previous one, and PRM can score each step
  3. The model can learn the pattern: By Round 2, the model has seen similar divisibility problems and learned the inclusion-exclusion approach

Not all problems have these properties. Open-ended problems (write an essay, design a system) don’t have automatic verification, so the approach breaks down. That’s a key limitation (Section 7).


Verification by Hand

Let’s verify the answer manually to confirm:

  • Multiples of 3 below 100: 3, 6, 9, …, 99. Count = ⌊99/3⌋ = 33. ✓
  • Multiples of 15 below 100: 15, 30, 45, 60, 75, 90. Count = ⌊99/15⌋ = 6. ✓
  • Divisible by 3 but NOT by 5: 33 - 6 = 27. ✓

By hand: 3, 6, 9, 12, 18, 21, 24, 27, 33, 36, 39, 42, 48, 51, 54, 57, 63, 66, 69, 72, 78, 81, 84, 87, 93, 96, 99 = 27 ✓