Section 08

Impact: The Reasoning Model Era Accelerates

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking 2025

The Moment: January 2025

rStar-Math was released in January 2025, the same month as DeepSeek R1. Together, these papers signalled something profound: frontier reasoning is no longer the exclusive domain of the largest labs.

DeepSeek R1: Parallel Validation

DeepSeek (a Chinese AI company) independently developed a similar approach:

  • Open-source reasoning model
  • Self-play training (similar to rStar-Math’s self-evolution)
  • Matching o1-preview on reasoning benchmarks
  • Released freely without proprietary restrictions

The fact that two independent teams arrived at the same idea (self-evolved reasoning) validates the approach. It’s not a fluke; it’s a principle.


What This Proves

Before 2025:

  • Frontier AI required billions in compute
  • Only OpenAI, Google, Meta, Microsoft could compete
  • Reproducibility was impossible (proprietary methods)

After rStar-Math and R1:

  • Frontier reasoning is reproducible
  • Mid-size labs can compete
  • Academic researchers can build state-of-the-art systems
  • The frontier is more accessible

Immediate Ecosystem Effects

Within weeks of rStar-Math’s release:

Open-source models jumped: Llama, Qwen, and other open-source bases integrated reasoning techniques, rapidly closing the gap to proprietary models.

Benchmark records fell: The MATH benchmark, considered hard just months earlier, became routine to solve at 85%+. The frontier moved to harder problems (AIME, IMO).

Research directions opened: Universities and small labs began exploring:

  • Self-play for code generation
  • MCTS for multi-agent planning
  • Self-evolution for scientific problem solving

Democratisation of Reasoning

The most profound impact: democratisation.

Before: Build frontier reasoning model → requires $100M+ → only big labs

After: Build frontier reasoning model → requires $1M + smart engineering → any lab with resources

For India, for example:

  • IIT research groups can now train competitive reasoning models
  • Startups can build on rStar-Math principles
  • Students can reproduce the work with access to cloud GPUs

This shifts the field from “closed proprietary models” to “open reproducible research.”


The Broader Narrative: Scaling Diversity

The field used to believe: Scaling = more parameters.

Papers like Test-Time Compute (Paper 23) and rStar-Math (Paper 24) showed: Scaling can also mean:

  • More inference-time compute (test-time search)
  • Better training data (self-evolved, verified)
  • Smarter algorithm design (MCTS instead of random sampling)

This broadens how labs can improve models. You don’t need infinite parameters; you need smart computation.


Industry Applications

OpenAI o1: Already deployed as a product. Users appreciate extended thinking for hard problems.

Google Gemini variants: Includes “thinking” mode, explicitly acknowledging the paradigm.

Anthropic Claude: Exploring similar reasoning capabilities.

Open-source models: Llama, Mistral, Qwen variants all adding reasoning-aware training.

Every major AI company now has a “reasoning model” variant. rStar-Math accelerated this adoption.


Research Directions Unlocked

Self-Play for Other Domains

  • Code: RL environment for code generation (HumanEval, MBPP)
  • Science: Reasoning in physics, chemistry, biology (with automatic verification via simulation)
  • Planning: Multi-step decision making (games, robotics)

Hybrid Approaches

  • Combining MCTS with other search algorithms (beam search, evolutionary algorithms)
  • Mixing self-evolution with human feedback (RLHF) for final polish
  • Transfer learning: train reasoning model on math, fine-tune on code

Verifier Research

  • Better PRMs (reward models that reliably score intermediate steps)
  • Weak verifiers for domains without perfect verification
  • Federated verification (ensemble of weak verifiers)

Competitive Landscape Shift

Before:

  • OpenAI: o1 and variants (frontier reasoning)
  • Google: Gemini (large, multimodal)
  • Others: Catch-up with scaling

After:

  • DeepSeek: Competitive frontier reasoning (open-source)
  • Meta/Llama: Reasoning-enhanced variants (open-source)
  • Microsoft: Integrating via GitHub Copilot
  • Open labs: Can now build frontier systems

The moat shifted from “model size” to “algorithm sophistication” and “data quality.”


Technical Implications for Future Work

Test-Time Compute is Mainstream

Inference-time search (MCTS, beam search, speculative decoding) is no longer a research curiosity. It’s a standard tool in the reasoning model toolkit.

Training Data Quality > Quantity

The old wisdom: more data is better. rStar-Math shows: high-quality data (self-generated, verified) beats raw quantity.

Bootstrapping is Powerful

Starting from 42% accuracy and reaching 90% through self-evolution validates the bootstrapping paradigm. Future work will explore this in other domains.

Domain-Specific Verification is Key

The paper’s success hinges on Python verification for math. Future work: what’s the “Python” for other domains? (automated test suites for code, simulation for physics, etc.)


Closing: The Boundary Moved

In September 2024 (o1), frontier reasoning seemed monopolized by one lab.

By January 2025 (rStar-Math, R1), it was reproducible and open-source.

This pattern — frontier → reproducible → commoditised — will repeat. By 2026, basic reasoning will be table stakes. The frontier will be harder (IMO-level problems, novel scientific reasoning, etc.).

The lesson for you: The frontier moves fast, but it moves in a direction. Understanding the principles (MCTS, self-evolution, verification) matters more than chasing the latest benchmark. Learn these, and you can build on the next frontier too.