Paper 23

Further Reading: Test-Time Compute and Beyond

The Original Paper

“Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters”

  • Authors: Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar
  • Published: ICLR 2025 (arxiv 2408.03314)
  • Link: https://arxiv.org/abs/2408.03314
  • What to notice: The empirical results on MATH showing 3.8B + TTC beating 70B, and the theoretical framework for compute-optimal strategies

Essential Follow-Ups

Let’s Verify Step by Step: Improving LLM Correctness via Iterative Verification Processes (Paper 16)

  • The foundation for Process Reward Models (PRMs)
  • Explains how to train a verifier that scores intermediate reasoning steps
  • Essential prerequisite to understanding how Best-of-N selection works

Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2022)

  • arXiv:2203.11171
  • Introduces the idea of sampling multiple reasoning paths and using majority voting
  • A precursor to modern test-time compute strategies
  • Good intuition: if you ask the model the same question 10 ways, do most paths lead to the same answer?

OpenAI o1 System Card (OpenAI, September 2024)

  • The first commercial realisation of test-time compute principles
  • Describes how o1 uses chain-of-thought reasoning and internal search
  • Shows the real-world impact: competitive with specialist tools on math and coding

Deep Research — DeepSeek R1: Open-Source Reasoning Model Matching o1-Preview (DeepSeek, January 2025)

  • arXiv:2501.12948
  • Proof that the approach works in open source without proprietary data
  • Demonstrates self-play and iterative training of reasoning models
  • Excellent example of scaling test-time compute in practice

“How OpenAI o1 Works” (Various technical blogs)

  • Many AI researchers have written accessible explanations of o1’s chain-of-thought mechanism
  • Search for “o1 reasoning model explanation” on Twitter/X or Substack for recent takes
  • These posts often connect o1 back to papers like this one

“Test-Time Compute: The New Frontier” (AI community discussions)

  • The paper sparked a wave of discourse about whether scaling at inference time is fundamentally different from training-time scaling
  • Look for threads discussing the implications for model efficiency and accessibility

Benchmarks and Datasets

MATH Dataset (Hendrycks et al., 2021)

  • 12,500 competition-level math problems (AMC, AIME)
  • arXiv:2103.15808
  • The primary evaluation benchmark for this paper
  • Problems range from high school to competition difficulty
  • Difficulty: Most require multi-step symbolic reasoning

GSM8K: Grade School Math 8K (Cobbe et al., 2021)

  • 8,500 grade school math word problems
  • arXiv:2110.14168
  • Simpler than MATH, good for testing basic reasoning
  • Shows how test-time compute helps at different difficulty levels

AIME and AMC Competition Problems

  • American Invitational Mathematics Examination (AIME) and American Mathematics Competitions (AMC)
  • Official source: https://www.maa.org/
  • These are the source of MATH’s hardest problems
  • Students typically spend 15–30 minutes per problem
  • Models matching human competitor times is a major milestone

Code and Implementations

Best-of-N Implementation

  • Straightforward to implement with any model API (OpenAI, Anthropic, HuggingFace)
  • Basic pattern:
    1. Call model N times (with different random seeds or high temperature)
    2. Score each output with a verifier (PRM, code execution, or human judgment)
    3. Select the highest-scoring output
    4. Return to user

Verifier/PRM Training Code

  • Training a Process Reward Model requires:
    • A dataset of reasoning traces with step-level correctness annotations
    • A model architecture that can score individual steps
    • Supervision signal: which steps are correct/incorrect
  • The Let’s Verify paper (Paper 16) provides detailed guidance

← Paper 22: Constitutional AI

  • Focuses on feedback and alignment
  • Sets the stage for understanding how to train verifiers (PRMs)

Paper 23: Test-Time Compute (you are here)

  • The framework for inference-time scaling

→ Paper 24: rStar-Math

  • Takes test-time compute to its logical conclusion
  • Shows that self-evolved training data (generated via MCTS) amplifies the benefits
  • The next step after understanding test-time compute: using it to improve the base model

Open Questions for Research

After reading this paper, consider:

  1. Can test-time compute help with open-ended tasks? (Writing, creative problem-solving, subjective evaluation.) The paper assumes deterministic correctness and a reliable verifier — what happens when these assumptions break?

  2. How does test-time compute interact with model size? If you have a 100B model, does adding test-time compute still help? Is there a saturation point?

  3. Can you train better models on test-time compute data? This is rStar-Math’s insight — use the high-quality reasoning traces generated by test-time search as training data for the next model iteration.

  4. What’s the optimal verifier architecture? PRMs are expensive to train. Can you get away with simpler, faster verifiers? Or do you need expensive step-level training?

  5. Latency optimization: Can you parallelize Best-of-N generation? Most papers assume sequential generation, but modern GPUs can handle multiple sequences in parallel.


Key Takeaways for Further Study

  • Test-time compute is complementary to training-time scale, not a replacement
  • Process Reward Models are critical — the quality of your verifier directly determines the quality of Best-of-N selection
  • Domain matters: Works best where correctness is checkable (math, code). Harder for open-ended tasks.
  • Latency is a real constraint: In production, you can’t always afford N=50 attempts
  • Self-evolution is powerful: Combine test-time compute search with supervised fine-tuning to create a feedback loop (rStar-Math)

Community and Discussion

  • OpenAI Research Forum: Discussions on test-time compute and scaling laws
  • Alignment Research Center (ARC): Work on process-based verification (related to PRM training)
  • Microsoft Research: Home of rStar-Math (Paper 24), extending these ideas
  • Anthropic: Constitutional AI and ongoing work on verifiers

Happy reading!