rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
What This Paper Did
In January 2025, Microsoft Research released a stunning result: a 7-billion-parameter open-source model — Qwen-7B, the size of a single high-end GPU’s working memory — matched OpenAI’s o1-preview on competition-level mathematics (MATH benchmark, 90% accuracy). No proprietary training data. No human annotation of solutions. Just four rounds of self-evolution.
The trick: use Monte Carlo Tree Search (MCTS) guided by a Process Reward Model (PRM) to generate high-quality step-by-step solutions in Python code. Verify them automatically (the code runs; does it give the right answer?). Train the model on this self-generated data. Repeat. With each round, both the model and the PRM improve, generating even better training data.
The narrative arc:
- Round 0: Base Qwen-7B, ~42% accuracy on MATH
- Round 1: MCTS + PRM₁ generates solutions → train model → 68% accuracy
- Round 2: Improved model generates better solutions → train → 78% accuracy
- Round 3: Better model generates better data → train → 85% accuracy
- Round 4: Best model generates best data → train → 90% accuracy (o1-preview tier)
This paper vindicates Paper 23’s (Test-Time Compute) core claim: intelligent inference-time computation can match or exceed raw model scale. But it goes further: self-generated training data, verified automatically, can bootstrap a small model to frontier performance.
The Indian Analogy
Imagine a brilliant IIT aspirant preparing for JEE without a tutor or coaching classes:
- Round 1: Sits for a practice exam (JEE mock). Solves each problem (generating solutions), then checks against a basic answer key (PRM₁). Identifies which approaches worked. Studies those solutions.
- Round 2: After studying, attempts a new mock exam. Now the student is sharper, can generate better solutions, and can critique themselves better too (PRM₂ is better because trained on Round 1 data). Studies the new solutions.
- Rounds 3 & 4: Repeats. Each cycle, the student learns from self-generated experience, not external teaching.
By Round 4, the student is competitive with the best IIT toppers — all through self-study, no tutor. That’s rStar-Math.
MCTS analogy: In chess, you don’t just play one game and move on. Strong players use analysis engines that explore thousands of move variations before committing to one. The engine uses a “coach” (evaluation function) to guide which variations to explore. rStar-Math applies this to math: explore many solution branches (MCTS), guided by a coach (PRM), then pick the best path.
Performance Comparison
| Model | Method | MATH Accuracy | Parameters | Training Data |
|---|---|---|---|---|
| Qwen-7B (base) | 1 attempt | 42.0% | 7B | Public (pre-training only) |
| Qwen-7B (rStar-Math) | 4-round self-evolution | ~90% | 7B | Self-generated + verified |
| GPT-4o | 1 attempt | 87.3% | ~100B | Proprietary |
| OpenAI o1-preview | Extended thinking | 90.6% | Proprietary | Proprietary |
The key insight: Same model size (7B), same hardware, radically different capability through self-evolved training data.
The Self-Evolution Loop (ASCII Diagram)
┌─────────────────────────────────────────────────────────────────────┐
│ ROUND 1: Base Model │
│ Qwen-7B-base (42% MATH accuracy) │
│ ↓ │
│ Run MCTS on all MATH training problems │
│ - Generate multiple solution paths for each problem │
│ - PRM₁ scores each step (promising or dead-end?) │
│ - Python execution verifies final answers (auto, no humans) │
│ ↓ │
│ Collect all CORRECT, high-PRM-score solutions │
│ ↓ │
│ Supervised Fine-Tuning: Train model on these solutions │
│ Also train PRM₂ on step-level annotations from MCTS │
│ ↓ │
│ Result: 68% accuracy on MATH │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ ROUND 2: Improved Model │
│ Qwen-7B-improved (68% MATH accuracy) │
│ ↓ │
│ Run MCTS with the improved model & PRM₂ │
│ (Better model generates better candidate solutions) │
│ (Better PRM scores them more accurately) │
│ ↓ │
│ Collect CORRECT solutions from the improved MCTS │
│ ↓ │
│ Train on this new, higher-quality data → 78% accuracy │
│ Train PRM₃ │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ ROUND 3 & 4: Cascading Improvement │
│ Each round: better model → better MCTS data → better training │
│ 85% → 90% (approaching o1-preview) │
│ │
│ The loop terminates when improvement plateaus (diminishing │
│ returns) or compute budget is exhausted. │
└─────────────────────────────────────────────────────────────────────┘
Before You Read
Prerequisite papers:
- Paper 16: Let’s Verify Step by Step — Understand Process Reward Models (PRM)
- Paper 23: Test-Time Compute — Understand why inference-time search works
Math/CS concepts:
- Monte Carlo Tree Search (MCTS) — Core algorithm for guided exploration
- Upper Confidence Bound (UCB) — How MCTS balances exploration and exploitation
- Supervised Fine-Tuning (SFT) — How to train on generated data
- Cross-Entropy Loss — Standard training objective
Benchmark:
- MATH dataset: 12,500 competition-level problems (AMC, AIME difficulty)
Reading Guide
| Section | Topic | Difficulty | Time | What You’ll Learn |
|---|---|---|---|---|
| 1 | Context & Setup | Beginner | 4 min | Why this problem matters, what came before |
| 2 | The Problem | Beginner | 3 min | Small models can’t solve hard math; big models are expensive |
| 3 | The Idea | Intermediate | 6 min | MCTS + PRM + Python verification + 4 rounds = frontier performance |
| 4 | The Math | Advanced | 10 min | UCB selection formula, PRM scoring, round-by-round improvement |
| 5 | Worked Example | Advanced | 8 min | Walk through one complete MCTS rollout on a real math problem |
| 6 | The Code | Intermediate | 7 min | Implement MCTS UCB, Program-of-Thought, auto-verification |
| 7 | Limitations | Intermediate | 4 min | What rStar-Math doesn’t do well (and why) |
| 8 | Impact | Beginner | 4 min | DeepSeek R1, o1 variants, democratisation of reasoning models |
| 9 | Summary | Beginner | 2 min | The big picture, what comes next |
Key Equations to Know
UCB (Upper Confidence Bound) for MCTS node selection: $$\text{UCB}(\text{node}) = \frac{Q(\text{node})}{N(\text{node})} + C \sqrt{\frac{\ln(N(\text{parent}))}{N(\text{node})}}$$
Balances exploitation (go to high-reward nodes) and exploration (visit under-explored nodes).
Best-of-N success probability: $$P(\text{at least one correct in } N \text{ samples}) = 1 - (1-p)^N$$
Shows why MCTS finding high-quality solutions matters: better p → higher success rate.
Round-by-round improvement (approximate):
- Round 0: 42% → Round 1: 68% (+26%) → Round 2: 78% (+10%) → Round 3: 85% (+7%) → Round 4: 90% (+5%)
Diminishing returns, as expected.
What’s New Here (vs. Paper 23)
Paper 23 showed: “Compute-optimal inference-time scaling beats raw model scale.”
Paper 24 adds: “Self-generated training data, verified automatically, enables 4 rounds of bootstrapping that amplifies the effect.”
The key innovation: close the loop. Use inference-time search (MCTS) to generate training data for supervised fine-tuning. The better model enables better search, generating better data, which trains an even better model. Four rounds later, frontier performance with no human annotation.
Navigation
← Paper 23: Test-Time Compute (inference-time search and Best-of-N)
Paper 24: rStar-Math (you are here) — Self-evolved training data meets test-time compute
Closing Note
You have now traced AI reasoning from first principles (Chain-of-Thought, Paper 14) → learned verification (Let’s Verify, Paper 16) → inference-time scaling (Test-Time Compute, Paper 23) → self-evolution (rStar-Math, Paper 24).
This is the frontier of 2025: reasoning models that learn from their own search, without human annotation, matching or exceeding human experts on rigorous domains.
The field moves quickly. By the time you read this, there may be Round 5, Round 6. But the principles stay the same. Keep learning.
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.