Section 09

Summary: From 42% to 90% Through Self-Evolution

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking 2025

The One-Sentence Version

A 7-billion-parameter open-source model, trained entirely on self-generated data (no human labels), matches OpenAI’s o1-preview on competition math by running 4 rounds of MCTS-guided search with automatic Python verification.

The Problem We Started With

Small models can’t solve hard math. Big models can, but they cost millions to train. OpenAI kept the formula secret. The question: Can any lab reproduce frontier reasoning without proprietary methods?

The Key Ideas

  1. Monte Carlo Tree Search (MCTS): Don’t sample solutions randomly. Use a Process Reward Model to guide search toward promising reasoning paths.

  2. Program-of-Thought (PoT): Write solutions in Python, not natural language. Run the code to verify automatically — no humans needed.

  3. Self-Evolution: Use MCTS to generate training data. Train the model on this data. The improved model generates better data for the next round.

  4. Four Rounds: Starting from 42% accuracy (Qwen-7B base), iterate: MCTS → verify → train → improve model → repeat.

  5. Virtuous Cycle: Each round, both the model and the verifier (PRM) improve. Better model → better solutions → better training data → even better model.

The Numbers That Matter

  • Starting accuracy: 42% (base Qwen-7B on MATH benchmark)
  • After Round 1: 68% (+26 percentage points)
  • After Round 2: 78% (+10 pp)
  • After Round 3: 85% (+7 pp)
  • After Round 4: 90% (+5 pp, matches o1-preview)
  • Model size: 7 billion parameters (no scale-up needed)
  • Training data source: Self-generated, auto-verified (no humans)
  • Compute cost: ~$1M in GPUs (vs. $100M+ for training o1)

The Indian Analogy

Imagine a brilliant student preparing for JEE without a tutor:

  • Round 1: Attempts a mock exam from a question bank. Reviews answers against a basic key. Identifies which approaches worked. Studies them.
  • Round 2: Attempts another mock. Now sharper, can tackle harder problems. Learns new techniques. Studies deeper.
  • Rounds 3–4: Repeats. Each cycle, the student improves through self-generated practice and self-critique.

By Round 4, competitive with the best IIT toppers. Same starting intelligence, but through smart self-improvement.

MCTS aspect: Don’t just solve each problem once. Think through multiple solution approaches (MCTS tree), guided by intuition (PRM) about which look promising. Pick the best.

What This Paper Adds (vs. Paper 23)

Paper 23 (Test-Time Compute): Showed that smart inference-time compute (Best-of-N + PRM) can match big models.

Paper 24 (rStar-Math): Showed that self-generated training data (via MCTS) amplifies this effect. Close the loop: use inference-time search to improve the base model itself.

The power of combining both: a flywheel where each round improves both the model and the verifier, cascading to frontier performance.

Key Assumptions (When This Works)

rStar-Math works because:

  1. Math has perfect verification (Python code execution)
  2. Solutions decompose into steps (PRM can score each step)
  3. Correctness is binary (right or wrong, no ambiguity)
  4. Base model is capable enough (7B is OK; 1B would struggle)

For domains without these properties (open-ended writing, subjective evaluation, ambiguous correctness), the approach breaks down.

What Came Next

DeepSeek R1 (January 2025): Independent team proved the approach works beyond just Microsoft. Open-source reasoning model.

Industry adoption: OpenAI o1, Google Gemini thinking, Anthropic reasoning variants. All use similar self-evolution principles.

Research directions: Self-play for code, science problems, planning. The ideas are spreading.

The Bigger Shift

Before 2025: Frontier AI = big labs only.

After 2025: Frontier reasoning = reproducible, accessible, open-source.

This is a democratisation. Any research group with modest GPU resources can now build competitive reasoning models. The frontier moved from “model size” to “algorithm sophistication.”

Key Limitations

  • Needs automatic verification (works for math; not for open-ended tasks)
  • Requires step-wise solution structure (breaks down for emergent problems)
  • Can’t add knowledge (only improves reasoning, not recall)
  • PRM bootstrapping is non-trivial (initial PRM must be decent)
  • Computational cost is high (not accessible to everyone)
  • MCTS is narrow (strong on math; unclear on other domains)

rStar-Math is not a universal solution. It’s a proof that self-evolution works in domains with certain properties.

Paper 23: Test-Time Compute (inference-time search and Best-of-N)

Paper 24: rStar-Math (you are here) — Self-evolved training data meets test-time compute


Closing

You have now traced the reasoning revolution from first principles:

  1. Paper 14 (Chain-of-Thought): Models reason better when they write out steps.
  2. Paper 16 (Let’s Verify): Score intermediate steps with PRMs.
  3. Paper 23 (Test-Time Compute): Smart inference-time search beats raw model scale.
  4. Paper 24 (rStar-Math): Self-evolved training data, verified automatically, bootstraps small models to frontier performance.

This is the foundation of the reasoning-model era that defines 2025.

The Final Question

Turing asked in 1950: “Can machines think?”

In 2025, the question isn’t “can they” but “how efficiently?” And the answer is increasingly: by learning to allocate their compute intelligently.

You have reached the final paper in the ainiketan.in series on AI reasoning. You have traced the journey from Turing’s question (1950) to self-evolving reasoning models (2025) — seven decades of progress compressed into 24 papers.

The field will keep moving. New benchmarks will emerge. New methods will be discovered. But the principles you’ve learned — reasoning through steps, verifying quality, allocating compute wisely, bootstrapping through self-generated data — will persist.

Keep reading. Keep building. The frontier needs you.

🎉 You've finished this paper!