Where We Are in January 2025
OpenAI o1 (September 2024) proved that extended thinking — letting an AI “think for longer” — produces superhuman mathematical reasoning. o1-preview achieved 90.6% accuracy on the MATH benchmark, a competition-level math dataset where earlier models (GPT-4) achieved 87%.
But o1 has problems:
- Closed-source: No one knows exactly how it works
- Expensive: Proprietary training data and months of research from OpenAI’s best researchers
- Inaccessible: Not available to researchers without an API key
- Opaque methods: The techniques are largely secret
The open-source community asks: Can we do this ourselves? Can a small, freely available model match o1’s reasoning ability through self-training, without proprietary data?
The Challenge
The MATH benchmark contains 12,500 competition-level problems (AMC and AIME difficulty). These are hard — they require:
- Multiple steps of reasoning
- Symbolic manipulation (algebra, geometry, number theory)
- Creative problem decomposition (no template fits them all)
- Self-correction (spotting and fixing errors mid-solution)
Base models (Qwen-7B, LLaMA-7B) achieve around 40–45% on MATH — not terrible, but far from o1’s 90%.
To improve, you have two paths:
- Train a bigger model (like OpenAI did for o1). Cost: millions of dollars, months of training.
- Use the small model smarter (as this paper does). Cost: weeks of GPU compute, automated data generation.
What Came Before: The Foundation
Chain-of-Thought (Paper 14, 2022): Models reason better when they write out their reasoning step by step.
Let’s Verify Step by Step (Paper 16, 2023): You can train a Process Reward Model (PRM) to score individual reasoning steps. This is much more useful than just checking if the final answer is right.
Test-Time Compute (Paper 23, 2024): Instead of training bigger models, spend more compute at inference time. Generate multiple solutions, use a PRM to pick the best. The result: 3.8B model + optimal test-time compute ≈ 70B model.
All three of these insights come together in rStar-Math:
- Use CoT (write out step-by-step reasoning)
- Use a PRM (score solution quality at each step)
- Use MCTS + test-time compute (intelligent search to find high-quality solutions)
- Close the loop: Train the model on the high-quality solutions you found
The Key Insight: Python is a Perfect Verifier
Here’s a crucial observation: Math problems can be solved in Python code.
Instead of reasoning in prose (“Let me count the multiples of 3…”), the model outputs Python:
count = len([x for x in range(1, 100) if x % 3 == 0])
answer = count
Now, to verify if the solution is correct, you just run the code:
exec(solution_code)
if answer == expected_answer:
label = "CORRECT"
else:
label = "WRONG"
No human labellers needed. No subjective judgment. Just automatic execution. This eliminates the biggest bottleneck in training data: human annotation.
Compare:
- Traditional approach: Generate solutions, hire mathematicians to grade them. Expensive, slow, limited scale.
- rStar-Math approach: Generate solutions in Python, run them, check automatically. Free, fast, unlimited scale.
Why This Matters
rStar-Math proves two things:
-
Small models can scale to frontier performance — not through training-time scale, but through better training data generation.
-
Self-evolution works — a model generating its own training data, verified automatically, can bootstrap itself. Each round of improvement unlocks better data generation for the next round.
For students and researchers without billions of dollars:
- You don’t need to train GPT-4 from scratch
- You don’t need to match OpenAI’s proprietary methods
- You can build a strong reasoning model using public datasets, MCTS, and automated verification
The paper is a statement: frontier AI is accessible.
The Setup
Starting point: Qwen-7B, a public 7-billion-parameter model from Alibaba. It’s smaller than GPT-3 (175B) and much smaller than o1 (estimated 100B+).
Training data: The MATH dataset (12,500 problems). No human solutions are used. The model generates solutions from scratch.
The 4-round loop:
- MCTS search on all problems, PRM guides search, Python verifies
- Train the model on the verified solutions
- Repeat with the improved model
The result: After 4 rounds, 90% accuracy on MATH. Matches o1-preview. Built on the same size model that started at 42%.
What We’ll Learn
In this paper, you’ll learn:
- MCTS for language models: How to structure search over reasoning steps
- Process Reward Models: How to train a “coach” that guides the search
- Program-of-Thought: Why Python verification is more powerful than natural language evaluation
- Self-evolution mechanics: Why bootstrapping works; where it plateaus
- Math reasoning at scale: What competitive-level reasoning actually requires
By the end, you’ll understand not just what rStar-Math did, but why it worked — and how to apply these ideas beyond just math.