Further Reading: Test-Time Compute and Beyond

The Original Paper

“Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters”

Authors: Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar
Published: ICLR 2025 (arxiv 2408.03314)
Link: https://arxiv.org/abs/2408.03314
What to notice: The empirical results on MATH showing 3.8B + TTC beating 70B, and the theoretical framework for compute-optimal strategies

Let’s Verify Step by Step: Improving LLM Correctness via Iterative Verification Processes (Paper 16)

Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2022)

arXiv:2203.11171
Introduces the idea of sampling multiple reasoning paths and using majority voting
A precursor to modern test-time compute strategies
Good intuition: if you ask the model the same question 10 ways, do most paths lead to the same answer?

OpenAI o1 System Card (OpenAI, September 2024)

The first commercial realisation of test-time compute principles
Describes how o1 uses chain-of-thought reasoning and internal search
Shows the real-world impact: competitive with specialist tools on math and coding

Deep Research — DeepSeek R1: Open-Source Reasoning Model Matching o1-Preview (DeepSeek, January 2025)

MATH Dataset (Hendrycks et al., 2021)

GSM8K: Grade School Math 8K (Cobbe et al., 2021)

AIME and AMC Competition Problems

American Invitational Mathematics Examination (AIME) and American Mathematics Competitions (AMC)
Official source: https://www.maa.org/
These are the source of MATH’s hardest problems
Students typically spend 15–30 minutes per problem
Models matching human competitor times is a major milestone

Best-of-N Implementation

Straightforward to implement with any model API (OpenAI, Anthropic, HuggingFace)
Basic pattern:
1. Call model N times (with different random seeds or high temperature)
2. Score each output with a verifier (PRM, code execution, or human judgment)
3. Select the highest-scoring output
4. Return to user

Verifier/PRM Training Code

Training a Process Reward Model requires:
- A dataset of reasoning traces with step-level correctness annotations
- A model architecture that can score individual steps
- Supervision signal: which steps are correct/incorrect
The Let’s Verify paper (Paper 16) provides detailed guidance

After reading this paper, consider:

Can test-time compute help with open-ended tasks? (Writing, creative problem-solving, subjective evaluation.) The paper assumes deterministic correctness and a reliable verifier — what happens when these assumptions break?
How does test-time compute interact with model size? If you have a 100B model, does adding test-time compute still help? Is there a saturation point?
Can you train better models on test-time compute data? This is rStar-Math’s insight — use the high-quality reasoning traces generated by test-time search as training data for the next model iteration.
What’s the optimal verifier architecture? PRMs are expensive to train. Can you get away with simpler, faster verifiers? Or do you need expensive step-level training?
Latency optimization: Can you parallelize Best-of-N generation? Most papers assume sequential generation, but modern GPUs can handle multiple sequences in parallel.

Test-time compute is complementary to training-time scale, not a replacement
Process Reward Models are critical — the quality of your verifier directly determines the quality of Best-of-N selection
Domain matters: Works best where correctness is checkable (math, code). Harder for open-ended tasks.
Latency is a real constraint: In production, you can’t always afford N=50 attempts
Self-evolution is powerful: Combine test-time compute search with supervised fine-tuning to create a feedback loop (rStar-Math)

OpenAI Research Forum: Discussions on test-time compute and scaling laws
Alignment Research Center (ARC): Work on process-based verification (related to PRM training)
Microsoft Research: Home of rStar-Math (Paper 24), extending these ideas
Anthropic: Constitutional AI and ongoing work on verifiers

Happy reading!