Context: The Question Behind This Work — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Where We Are in January 2025

OpenAI o1 (September 2024) proved that extended thinking — letting an AI “think for longer” — produces superhuman mathematical reasoning. o1-preview achieved 90.6% accuracy on the MATH benchmark, a competition-level math dataset where earlier models (GPT-4) achieved 87%.

But o1 has problems:

Closed-source: No one knows exactly how it works
Expensive: Proprietary training data and months of research from OpenAI’s best researchers
Inaccessible: Not available to researchers without an API key
Opaque methods: The techniques are largely secret

The open-source community asks: Can we do this ourselves? Can a small, freely available model match o1’s reasoning ability through self-training, without proprietary data?

The Challenge

The MATH benchmark contains 12,500 competition-level problems (AMC and AIME difficulty). These are hard — they require:

Multiple steps of reasoning
Symbolic manipulation (algebra, geometry, number theory)
Creative problem decomposition (no template fits them all)
Self-correction (spotting and fixing errors mid-solution)

Base models (Qwen-7B, LLaMA-7B) achieve around 40–45% on MATH — not terrible, but far from o1’s 90%.

To improve, you have two paths:

Train a bigger model (like OpenAI did for o1). Cost: millions of dollars, months of training.
Use the small model smarter (as this paper does). Cost: weeks of GPU compute, automated data generation.

What Came Before: The Foundation

Chain-of-Thought (Paper 14, 2022): Models reason better when they write out their reasoning step by step.

Let’s Verify Step by Step (Paper 16, 2023): You can train a Process Reward Model (PRM) to score individual reasoning steps. This is much more useful than just checking if the final answer is right.

Test-Time Compute (Paper 23, 2024): Instead of training bigger models, spend more compute at inference time. Generate multiple solutions, use a PRM to pick the best. The result: 3.8B model + optimal test-time compute ≈ 70B model.

All three of these insights come together in rStar-Math:

Use CoT (write out step-by-step reasoning)
Use a PRM (score solution quality at each step)
Use MCTS + test-time compute (intelligent search to find high-quality solutions)
Close the loop: Train the model on the high-quality solutions you found

The Key Insight: Python is a Perfect Verifier

Here’s a crucial observation: Math problems can be solved in Python code.

Instead of reasoning in prose (“Let me count the multiples of 3…”), the model outputs Python:

count = len([x for x in range(1, 100) if x % 3 == 0])
answer = count

Now, to verify if the solution is correct, you just run the code:

exec(solution_code)
if answer == expected_answer:
    label = "CORRECT"
else:
    label = "WRONG"

No human labellers needed. No subjective judgment. Just automatic execution. This eliminates the biggest bottleneck in training data: human annotation.

Compare:

Traditional approach: Generate solutions, hire mathematicians to grade them. Expensive, slow, limited scale.
rStar-Math approach: Generate solutions in Python, run them, check automatically. Free, fast, unlimited scale.

Why This Matters

rStar-Math proves two things:

Small models can scale to frontier performance — not through training-time scale, but through better training data generation.
Self-evolution works — a model generating its own training data, verified automatically, can bootstrap itself. Each round of improvement unlocks better data generation for the next round.

For students and researchers without billions of dollars:

You don’t need to train GPT-4 from scratch
You don’t need to match OpenAI’s proprietary methods
You can build a strong reasoning model using public datasets, MCTS, and automated verification

The paper is a statement: frontier AI is accessible.

The Setup

Starting point: Qwen-7B, a public 7-billion-parameter model from Alibaba. It’s smaller than GPT-3 (175B) and much smaller than o1 (estimated 100B+).

Training data: The MATH dataset (12,500 problems). No human solutions are used. The model generates solutions from scratch.

The 4-round loop:

MCTS search on all problems, PRM guides search, Python verifies
Train the model on the verified solutions
Repeat with the improved model

The result: After 4 rounds, 90% accuracy on MATH. Matches o1-preview. Built on the same size model that started at 42%.

What We’ll Learn

In this paper, you’ll learn:

MCTS for language models: How to structure search over reasoning steps
Process Reward Models: How to train a “coach” that guides the search
Program-of-Thought: Why Python verification is more powerful than natural language evaluation
Self-evolution mechanics: Why bootstrapping works; where it plateaus
Math reasoning at scale: What competitive-level reasoning actually requires

By the end, you’ll understand not just what rStar-Math did, but why it worked — and how to apply these ideas beyond just math.