rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

What This Paper Did

In January 2025, Microsoft Research released a stunning result: a 7-billion-parameter open-source model — Qwen-7B, the size of a single high-end GPU’s working memory — matched OpenAI’s o1-preview on competition-level mathematics (MATH benchmark, 90% accuracy). No proprietary training data. No human annotation of solutions. Just four rounds of self-evolution.

The trick: use Monte Carlo Tree Search (MCTS) guided by a Process Reward Model (PRM) to generate high-quality step-by-step solutions in Python code. Verify them automatically (the code runs; does it give the right answer?). Train the model on this self-generated data. Repeat. With each round, both the model and the PRM improve, generating even better training data.

The narrative arc:

Round 0: Base Qwen-7B, ~42% accuracy on MATH
Round 1: MCTS + PRM₁ generates solutions → train model → 68% accuracy
Round 2: Improved model generates better solutions → train → 78% accuracy
Round 3: Better model generates better data → train → 85% accuracy
Round 4: Best model generates best data → train → 90% accuracy (o1-preview tier)

This paper vindicates Paper 23’s (Test-Time Compute) core claim: intelligent inference-time computation can match or exceed raw model scale. But it goes further: self-generated training data, verified automatically, can bootstrap a small model to frontier performance.

The Indian Analogy

Imagine a brilliant IIT aspirant preparing for JEE without a tutor or coaching classes:

Round 1: Sits for a practice exam (JEE mock). Solves each problem (generating solutions), then checks against a basic answer key (PRM₁). Identifies which approaches worked. Studies those solutions.
Round 2: After studying, attempts a new mock exam. Now the student is sharper, can generate better solutions, and can critique themselves better too (PRM₂ is better because trained on Round 1 data). Studies the new solutions.
Rounds 3 & 4: Repeats. Each cycle, the student learns from self-generated experience, not external teaching.

By Round 4, the student is competitive with the best IIT toppers — all through self-study, no tutor. That’s rStar-Math.

MCTS analogy: In chess, you don’t just play one game and move on. Strong players use analysis engines that explore thousands of move variations before committing to one. The engine uses a “coach” (evaluation function) to guide which variations to explore. rStar-Math applies this to math: explore many solution branches (MCTS), guided by a coach (PRM), then pick the best path.

Performance Comparison

Model	Method	MATH Accuracy	Parameters	Training Data
Qwen-7B (base)	1 attempt	42.0%	7B	Public (pre-training only)
Qwen-7B (rStar-Math)	4-round self-evolution	~90%	7B	Self-generated + verified
GPT-4o	1 attempt	87.3%	~100B	Proprietary
OpenAI o1-preview	Extended thinking	90.6%	Proprietary	Proprietary

The key insight: Same model size (7B), same hardware, radically different capability through self-evolved training data.

The Self-Evolution Loop (ASCII Diagram)

┌─────────────────────────────────────────────────────────────────────┐
│                    ROUND 1: Base Model                              │
│  Qwen-7B-base (42% MATH accuracy)                                   │
│              ↓                                                       │
│  Run MCTS on all MATH training problems                             │
│  - Generate multiple solution paths for each problem                │
│  - PRM₁ scores each step (promising or dead-end?)                   │
│  - Python execution verifies final answers (auto, no humans)        │
│              ↓                                                       │
│  Collect all CORRECT, high-PRM-score solutions                      │
│              ↓                                                       │
│  Supervised Fine-Tuning: Train model on these solutions             │
│  Also train PRM₂ on step-level annotations from MCTS                │
│              ↓                                                       │
│  Result: 68% accuracy on MATH                                       │
└─────────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────────┐
│                    ROUND 2: Improved Model                          │
│  Qwen-7B-improved (68% MATH accuracy)                               │
│              ↓                                                       │
│  Run MCTS with the improved model & PRM₂                            │
│  (Better model generates better candidate solutions)                │
│  (Better PRM scores them more accurately)                           │
│              ↓                                                       │
│  Collect CORRECT solutions from the improved MCTS                   │
│              ↓                                                       │
│  Train on this new, higher-quality data → 78% accuracy              │
│  Train PRM₃                                                         │
└─────────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────────┐
│                    ROUND 3 & 4: Cascading Improvement               │
│  Each round: better model → better MCTS data → better training      │
│  85% → 90% (approaching o1-preview)                                 │
│                                                                     │
│  The loop terminates when improvement plateaus (diminishing          │
│  returns) or compute budget is exhausted.                           │
└─────────────────────────────────────────────────────────────────────┘

Before You Read

Prerequisite papers:

Paper 16: Let’s Verify Step by Step — Understand Process Reward Models (PRM)
Paper 23: Test-Time Compute — Understand why inference-time search works

Math/CS concepts:

Monte Carlo Tree Search (MCTS) — Core algorithm for guided exploration
Upper Confidence Bound (UCB) — How MCTS balances exploration and exploitation
Supervised Fine-Tuning (SFT) — How to train on generated data
Cross-Entropy Loss — Standard training objective

Benchmark:

MATH dataset: 12,500 competition-level problems (AMC, AIME difficulty)

Reading Guide

Section	Topic	Difficulty	Time	What You’ll Learn
1	Context & Setup	Beginner	4 min	Why this problem matters, what came before
2	The Problem	Beginner	3 min	Small models can’t solve hard math; big models are expensive
3	The Idea	Intermediate	6 min	MCTS + PRM + Python verification + 4 rounds = frontier performance
4	The Math	Advanced	10 min	UCB selection formula, PRM scoring, round-by-round improvement
5	Worked Example	Advanced	8 min	Walk through one complete MCTS rollout on a real math problem
6	The Code	Intermediate	7 min	Implement MCTS UCB, Program-of-Thought, auto-verification
7	Limitations	Intermediate	4 min	What rStar-Math doesn’t do well (and why)
8	Impact	Beginner	4 min	DeepSeek R1, o1 variants, democratisation of reasoning models
9	Summary	Beginner	2 min	The big picture, what comes next

Key Equations to Know

UCB (Upper Confidence Bound) for MCTS node selection: $$\text{UCB}(\text{node}) = \frac{Q(\text{node})}{N(\text{node})} + C \sqrt{\frac{\ln(N(\text{parent}))}{N(\text{node})}}$$

Balances exploitation (go to high-reward nodes) and exploration (visit under-explored nodes).

Best-of-N success probability: $$P(\text{at least one correct in } N \text{ samples}) = 1 - (1-p)^N$$

Shows why MCTS finding high-quality solutions matters: better p → higher success rate.

Round-by-round improvement (approximate):

Round 0: 42% → Round 1: 68% (+26%) → Round 2: 78% (+10%) → Round 3: 85% (+7%) → Round 4: 90% (+5%)

Diminishing returns, as expected.

What’s New Here (vs. Paper 23)

Paper 23 showed: “Compute-optimal inference-time scaling beats raw model scale.”

Paper 24 adds: “Self-generated training data, verified automatically, enables 4 rounds of bootstrapping that amplifies the effect.”

The key innovation: close the loop. Use inference-time search (MCTS) to generate training data for supervised fine-tuning. The better model enables better search, generating better data, which trains an even better model. Four rounds later, frontier performance with no human annotation.

← Paper 23: Test-Time Compute (inference-time search and Best-of-N)

Paper 24: rStar-Math (you are here) — Self-evolved training data meets test-time compute

Closing Note

You have now traced AI reasoning from first principles (Chain-of-Thought, Paper 14) → learned verification (Let’s Verify, Paper 16) → inference-time scaling (Test-Time Compute, Paper 23) → self-evolution (rStar-Math, Paper 24).

This is the frontier of 2025: reasoning models that learn from their own search, without human annotation, matching or exceeding human experts on rigorous domains.

The field moves quickly. By the time you read this, there may be Round 5, Round 6. But the principles stay the same. Keep learning.