The Problem: Why Small Models Fail at Hard Math — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

The Core Challenge

Qwen-7B, trained only on public pre-training data, achieves ~42% accuracy on the MATH benchmark. This means: out of 10 hard competition math problems, it gets 4 right and 6 wrong.

Why can’t we just improve this with more training data?

The Data Problem

The MATH dataset contains 12,500 problems with human-written solutions. You might think: “Just fine-tune on all of them!” But this doesn’t work well because:

Solution quality varies: Many solutions are correct in the final answer but use non-standard techniques or unclear reasoning. The model learns inconsistent patterns.
Dataset is small: 12,500 examples is not much for deep learning. GPT-4 was trained on hundreds of billions of tokens. 12,500 problems, each with a paragraph of reasoning, is maybe 1–2 million tokens — a rounding error in modern training.
Noise in annotations: Some solutions in the dataset are outright wrong or contain errors. Training on these propagates mistakes.
Coverage is uneven: The dataset emphasizes certain problem types (algebra) over others (geometry). The model doesn’t get balanced exposure.

Why Just “More Compute” Doesn’t Work

Test-Time Compute (Paper 23) suggests: generate N solutions, pick the best.

But here’s the issue: when the base model’s success rate is 42%, even with N=100 attempts, you’re generating 100 bad solutions and hoping one is good by luck.

Example:

42% base accuracy means 58% failure rate per attempt
Best-of-100 success rate = 1 - (0.58)^100 ≈ 100%… but this is misleading!

In reality:

The first 99 attempts are almost certainly wrong
The PRM (Process Reward Model) must identify that 1 correct solution among 99 wrong ones
If the PRM makes even a small error (5% error rate), it might pick a wrong solution

For hard problems where the base model’s reasoning is fragile, sampling more of the same thing doesn’t help much.

You need to generate better solutions in the first place.

Why Bigger Models Don’t Solve It (For Everyone)

OpenAI trained o1, which is estimated at ~100B parameters. This works:

Costs: ~$10 million in compute + months of research
Accessible to: OpenAI and maybe 3 other labs globally

For everyone else, this path is closed.

The Real Root Cause: Lack of High-Quality Training Data

The fundamental problem is data, not model size:

A 7B model trained on truly excellent step-by-step reasoning traces could be far stronger
But generating such data requires human experts (expensive) or automated verification (hard)

For math, the unique advantage is: Python code is automatically verifiable.

What We Need

An algorithm that can:

Generate candidate solutions for hard math problems (even if many are wrong)
Verify solutions automatically (execute Python code, check if answer is correct)
Evaluate reasoning quality at each step (which approach is promising? Which is a dead-end?)
Collect high-quality data — only the correct, high-quality solutions
Train the model on this data
Improve the evaluation function so the next iteration generates even better solutions
Repeat — bootstrap the model through multiple rounds

This is exactly what rStar-Math does.

Why This Is Hard (And Why It Matters)

Each of these steps individually is known:

Generating multiple solutions: just sample from the model (easy)
Verifying with Python: straightforward (easy)
Evaluating reasoning: use a Process Reward Model (known from Paper 16)
Training: standard supervised fine-tuning (easy)

But combining them into a self-improving loop is non-trivial:

The PRM’s quality determines the quality of selected solutions
The quality of solutions determines how well the next model trains
A weak PRM early on might select bad solutions, poison the next training round

The paper’s main contribution is showing that this bootstrapping actually works — starting from a weak PRM and weak model, you can improve both together across rounds, eventually reaching frontier performance.

The Vision

Instead of:

“Spend $100M to train a 100B model”

We want:

“Run smart search on a 7B model, train it on self-generated data, and iterate”
Cost: ~$1M in GPU compute, weeks instead of months
Accessible: any research group with modest resources

This paper proves this vision is possible.