Let’s walk through one complete MCTS search for a specific math problem. This will show exactly how MCTS + PRM + Python verification work together.
The Problem
“How many positive integers less than 100 are divisible by 3 but not by 5?”
Expected answer: 27
(Explanation: Divisible by 3 below 100 = 33 numbers. Divisible by both 3 and 5 (i.e., by 15) = 6 numbers. Inclusion-exclusion: 33 - 6 = 27.)
Round 1 of MCTS Search
The MCTS tree is initially empty (just the root).
Step 1: Initialization (Root Node)
Node state: “Problem: count integers 1-99 divisible by 3 but not by 5”
The root has no children yet. MCTS will generate candidate next steps.
Step 2: Selection (First Expansion)
The model (Qwen-7B) generates three candidate first steps:
Branch A: Mathematical reasoning in natural language
- “Multiples of 3 below 100: 3, 6, 9, …, 99. This is 99/3 = 33 numbers.”
- State: A₁ (after this step, we’ve counted divisible-by-3)
Branch B: Direct Python calculation
- “count_div3 = len([x for x in range(1, 100) if x % 3 == 0])”
- State: B₁ (after this step, count_div3 is computed)
Branch C: Formula-based approach
- “Use the formula for arithmetic sequences: a_n = a_1 + (n-1)d. Here a_1=3, d=3, a_n=99. Solve 99 = 3 + (n-1)×3 → n = 33.”
- State: C₁ (after this step, we know the count is 33)
Step 3: PRM Evaluation of Candidates
PRM₁ (initially trained) scores each candidate:
| Branch | Step Content | PRM Score | Reason |
|---|---|---|---|
| A | Natural language reasoning | 0.75 | Correct logic, but harder to verify |
| B | Python code | 0.95 | Unambiguous, auto-verifiable |
| C | Algebra | 0.82 | Correct, but more complex |
MCTS selects Branch B (highest UCB, because it has few visits and high PRM score).
Step 4: Expand Branch B
Current solution trace:
count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
The model generates the next logical step:
Candidate next steps:
- “Now count multiples of 15 (divisible by both 3 and 5): count_div15 = len([x for x in range(1, 100) if x % 15 == 0])”
- “Subtract immediately: answer = count_div3 - 6” (incorrect — we haven’t counted the 6 yet)
- “Verify: run the code so far and print count_div3”
PRM scores:
- Candidate 1: 0.92 (correct, logical next step)
- Candidate 2: 0.40 (wrong — jumps to answer without intermediate step)
- Candidate 3: 0.88 (correct but maybe redundant)
MCTS expands Candidate 1 (B→B1).
New solution state (B₁):
count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])
Step 5: Continue from B₁
Model generates the next step:
Candidate next steps:
- “answer = count_div3 - count_div15” (correct final step)
- “Check: are there any numbers divisible by 5 but not 3? We don’t need to subtract those.” (confused reasoning)
PRM scores:
- Candidate 1: 0.91 (correct)
- Candidate 2: 0.35 (confused logic)
MCTS expands Candidate 1 (B₁→B₂).
Final solution state (B₂):
count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])
answer = count_div3 - count_div15
Step 6: Verification (Leaf Node Evaluation)
Now we have a complete solution. Run the code:
count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])
answer = count_div3 - count_div15
Execution:
count_div3 = len([3, 6, 9, ..., 99]) = 33count_div15 = len([15, 30, 45, 60, 75, 90]) = 6answer = 33 - 6 = 27
Expected answer: 27. ✓ CORRECT!
Reward: +1 (correct solution with high-quality reasoning)
Step 7: Backup (Propagate Reward Up the Tree)
Update all nodes on the path with the reward:
Root (42% base accuracy)
└─ B₁ "count_div3..." [PRM: 0.95]
└─ B₁ → B₂ "count_div15..." [PRM: 0.92]
└─ B₂ → answer... [PRM: 0.91]
✓ Rollout success: reward = 1
- B₂ → answer: visits = 1, Q = 1
- B₁ → B₂: visits = 1, Q = 1
- Root → B: visits = 1, Q = 1
What the Training Data Looks Like
This MCTS rollout produces one training example:
Input (problem description): “How many positive integers less than 100 are divisible by 3 but not by 5?”
Output (solution trace, step-by-step):
# Step 1: Count multiples of 3
count_div3 = len([x for x in range(1, 100) if x % 3 == 0])
# Step 2: Count multiples of 15 (both 3 and 5)
count_div15 = len([x for x in range(1, 100) if x % 15 == 0])
# Step 3: Use inclusion-exclusion
answer = count_div3 - count_div15
Step-level PRM annotations:
- Step 1: [0.95] (very confident this is correct)
- Step 2: [0.92] (confident this is correct)
- Step 3: [0.91] (confident this is correct)
Final verification:
- Execution result: answer = 27
- Expected result: 27
- Label: CORRECT ✓
What Happens After Multiple Rollouts?
In a real MCTS search, you don’t stop after one rollout. You continue:
Rollout 2: Same problem, but maybe the search explores Branch A or C this time (because B has been visited once, exploration bonus sends MCTS to less-explored branches).
Rollout 3-N: Keep exploring different paths. Eventually, MCTS converges:
- Branch B consistently leads to correct answers
- Branches A and C also work but are less efficient
- The search tree settles on B as the best approach
After ~100 rollouts, MCTS has thoroughly explored the solution space and identified the highest-quality reasoning path(s).
Rounds 2–4: How the Model Improves
After Round 1:
- We collect ~8,000 solutions like the one above from MCTS across all 12,500 problems
- All solutions are CORRECT (verified by Python execution)
- All have high PRM scores (good reasoning quality)
- Train Qwen-7B on these 8,000 solutions → 68% accuracy
Now, in Round 2:
- Run MCTS again, but the model is better (68% vs. 42%)
- It generates better candidate steps (because it’s learned from Round 1 data)
- MCTS finds even higher-quality solutions
- These solutions train the model further → 78% accuracy
The cycle repeats.
Why This Example Works
For this specific problem:
- Python verification is perfect: We can run the code and check the answer with 100% certainty
- The solution is step-wise compositional: Each step builds on the previous one, and PRM can score each step
- The model can learn the pattern: By Round 2, the model has seen similar divisibility problems and learned the inclusion-exclusion approach
Not all problems have these properties. Open-ended problems (write an essay, design a system) don’t have automatic verification, so the approach breaks down. That’s a key limitation (Section 7).
Verification by Hand
Let’s verify the answer manually to confirm:
- Multiples of 3 below 100: 3, 6, 9, …, 99. Count = ⌊99/3⌋ = 33. ✓
- Multiples of 15 below 100: 15, 30, 45, 60, 75, 90. Count = ⌊99/15⌋ = 6. ✓
- Divisible by 3 but NOT by 5: 33 - 6 = 27. ✓
By hand: 3, 6, 9, 12, 18, 21, 24, 27, 33, 36, 39, 42, 48, 51, 54, 57, 63, 66, 69, 72, 78, 81, 84, 87, 93, 96, 99 = 27 ✓