The Core Challenge
Qwen-7B, trained only on public pre-training data, achieves ~42% accuracy on the MATH benchmark. This means: out of 10 hard competition math problems, it gets 4 right and 6 wrong.
Why can’t we just improve this with more training data?
The Data Problem
The MATH dataset contains 12,500 problems with human-written solutions. You might think: “Just fine-tune on all of them!” But this doesn’t work well because:
-
Solution quality varies: Many solutions are correct in the final answer but use non-standard techniques or unclear reasoning. The model learns inconsistent patterns.
-
Dataset is small: 12,500 examples is not much for deep learning. GPT-4 was trained on hundreds of billions of tokens. 12,500 problems, each with a paragraph of reasoning, is maybe 1–2 million tokens — a rounding error in modern training.
-
Noise in annotations: Some solutions in the dataset are outright wrong or contain errors. Training on these propagates mistakes.
-
Coverage is uneven: The dataset emphasizes certain problem types (algebra) over others (geometry). The model doesn’t get balanced exposure.
Why Just “More Compute” Doesn’t Work
Test-Time Compute (Paper 23) suggests: generate N solutions, pick the best.
But here’s the issue: when the base model’s success rate is 42%, even with N=100 attempts, you’re generating 100 bad solutions and hoping one is good by luck.
Example:
- 42% base accuracy means 58% failure rate per attempt
- Best-of-100 success rate = 1 - (0.58)^100 ≈ 100%… but this is misleading!
In reality:
- The first 99 attempts are almost certainly wrong
- The PRM (Process Reward Model) must identify that 1 correct solution among 99 wrong ones
- If the PRM makes even a small error (5% error rate), it might pick a wrong solution
For hard problems where the base model’s reasoning is fragile, sampling more of the same thing doesn’t help much.
You need to generate better solutions in the first place.
Why Bigger Models Don’t Solve It (For Everyone)
OpenAI trained o1, which is estimated at ~100B parameters. This works:
- Costs: ~$10 million in compute + months of research
- Accessible to: OpenAI and maybe 3 other labs globally
For everyone else, this path is closed.
The Real Root Cause: Lack of High-Quality Training Data
The fundamental problem is data, not model size:
- A 7B model trained on truly excellent step-by-step reasoning traces could be far stronger
- But generating such data requires human experts (expensive) or automated verification (hard)
For math, the unique advantage is: Python code is automatically verifiable.
What We Need
An algorithm that can:
- Generate candidate solutions for hard math problems (even if many are wrong)
- Verify solutions automatically (execute Python code, check if answer is correct)
- Evaluate reasoning quality at each step (which approach is promising? Which is a dead-end?)
- Collect high-quality data — only the correct, high-quality solutions
- Train the model on this data
- Improve the evaluation function so the next iteration generates even better solutions
- Repeat — bootstrap the model through multiple rounds
This is exactly what rStar-Math does.
Why This Is Hard (And Why It Matters)
Each of these steps individually is known:
- Generating multiple solutions: just sample from the model (easy)
- Verifying with Python: straightforward (easy)
- Evaluating reasoning: use a Process Reward Model (known from Paper 16)
- Training: standard supervised fine-tuning (easy)
But combining them into a self-improving loop is non-trivial:
- The PRM’s quality determines the quality of selected solutions
- The quality of solutions determines how well the next model trains
- A weak PRM early on might select bad solutions, poison the next training round
The paper’s main contribution is showing that this bootstrapping actually works — starting from a weak PRM and weak model, you can improve both together across rounds, eventually reaching frontier performance.
The Vision
Instead of:
- “Spend $100M to train a 100B model”
We want:
- “Run smart search on a 7B model, train it on self-generated data, and iterate”
- Cost: ~$1M in GPU compute, weeks instead of months
- Accessible: any research group with modest resources
This paper proves this vision is possible.