Section 09

Summary: The Power of Thinking Harder

Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters 2024

The One-Sentence Version

For hard reasoning tasks, spending more compute at inference time — generating multiple solutions or iteratively refining one — can match or exceed the performance of a model 20× larger.

The Problem We Started With

Larger models are expensive to train. Training a 70B-parameter model costs millions of dollars. A smaller 3.8B-parameter model costs a fraction of that — but it’s much weaker on reasoning tasks, especially competition-level math.

The paper asks: instead of training bigger models, what if we let smaller models think longer?

The Key Ideas

  1. Best-of-N: Generate N independent solutions, use a Process Reward Model (PRM) to pick the best. The probability of at least one being correct is 1 - (1-p)^N. Even with p = 0.1 (10% base accuracy), N=29 attempts gives 95% accuracy.

  2. Sequential revision: Iteratively refine a solution, getting feedback and improving each round. Better when the model can learn from its mistakes.

  3. Compute-optimal strategy: Choose between Best-of-N and sequential revision based on problem difficulty. Hard reasoning problems → Best-of-N. Iterative refinement problems → sequential revision.

  4. The critical assumption: You need a reliable verifier (Process Reward Model) to select the best solution. For math, Python execution provides ground truth.

The Numbers That Matter

  • 3.8B model + optimal TTC ≈ 70B model on MATH benchmark (competition-level math)
  • Base 3.8B alone: ~42% accuracy on MATH
  • With TTC (Best-of-N + PRM): ~90% accuracy on MATH
  • 70B model baseline: ~90% accuracy on MATH (without TTC)

The breakthrough: you can get frontier performance from a much smaller base model if you’re smart about how you allocate compute at inference time.

The Indian Analogy

Imagine a student preparing for JEE (entrance exam for IIT):

  • Closed-book exam (base model): One shot, no revision. Either you know it or you don’t.
  • With thinking time (sequential revision): Read the question, work through it carefully, double-check your work. Better accuracy, but takes longer.
  • Best-of-N approach: Solve the same problem 5 different ways, check which method gives consistent answers, pick that one.
  • With a good verifier (PRM): After each method, immediately check: “Does this method make sense? Does the final answer feel right?” Use that feedback to pick the strongest approach.

By Round 4 of self-study (rStar-Math, Paper 24), the student is competitive with IIT toppers. Same starting point, but through smart iteration, they catch up.

What Came Next

  • OpenAI o1 (September 2024): The public realisation of test-time compute. o1 “thinks” internally for seconds or minutes on hard problems, then answers.
  • DeepSeek R1 (January 2025): Open-source version, proving the approach scales beyond proprietary methods.
  • Google Gemini 2.0 Flash Thinking (2024): Google’s reasoning model, using similar principles.
  • rStar-Math (Paper 24): Takes the idea even further — small models achieve frontier performance through 4 rounds of self-evolution with MCTS-guided search.

The Bigger Picture

This paper marks a shift in how the field thinks about AI reasoning:

Before: Bigger model = better reasoning. Scale at training time.

After: Compute allocation matters as much as scale. Spend thoughtfully at inference time.

The question “Can machines think?” (Turing, 1950) is increasingly answered not with “they have bigger brains” but with “they learn to allocate their compute intelligently.”

Key Limitations to Remember

  • Needs a reliable verifier (only works for well-defined domains like math)
  • Latency multiplies with more attempts (users wait longer)
  • Can’t compensate for missing knowledge (still need training on facts)
  • Cost multiplies at hyperscale (50 attempts = 50× tokens)
  • Unknown how well this generalises beyond mathematical reasoning

← Paper 22: Constitutional AI and feedback alignment

Paper 23: Test-Time Compute (you are here)

→ Paper 24: rStar-Math — showing that small models can match o1 through self-evolved deep thinking

You’ve now understood the foundation. Paper 24 builds on this to show the remarkable power of combining test-time compute with automated data generation. From there, the field moves into the reasoning-model era that defines 2025.

🎉 You've finished this paper!