By 2023, the AI field had a clear understanding of scaling laws: bigger models are better models. Papers like “Scaling Laws for Neural Language Models” (Hoffmann et al., 2022) showed that:
- Model loss decreases predictably with model size (more parameters = lower loss)
- Model loss decreases predictably with training data size (more tokens = lower loss)
- Model loss decreases predictably with training compute (more FLOPs = lower loss)
The three factors (parameters, data, compute) scale together predictably. The formula was: loss = a·N^(-α) + b·D^(-β) + c·C^(-γ), where N is parameters, D is data, C is compute, and α, β, γ are constants around 0.07 to 0.1.
This led to a straightforward strategy: to improve your model, make it bigger and train on more data with more compute.
OpenAI’s scaling strategy: GPT-2 (1.5B) → GPT-3 (175B) → GPT-4 (~1.7T, rumored). Each step, bigger model, more data, more training compute.
Meta’s strategy: LLaMA (7B, 13B, 65B, 70B). Pushing the parameter count.
Google’s strategy: PaLM (540B). DeepMind + Google: Gemini (competing on scale and task coverage).
The narrative was: “Scaling is all you need.” Just make the models bigger, and they’ll get smarter at everything.
By 2024, this narrative was still dominant, but a few cracks were showing:
-
Diminishing returns: Scaling from 1B to 7B gave huge improvements. Scaling from 7B to 70B was good. Scaling from 70B to 1.7T was measurable but smaller gains per dollar spent.
-
Inference time is critical: Training a 70B model takes weeks. Inference with a 70B model is expensive. The cost per inference grows with model size. For many applications, you can’t afford inference on the largest models.
-
Some problems resist pure scaling: Competitive math problems, hard reasoning tasks, and complex multi-step problems still defeated even the largest models at the time (GPT-4, PaLM 2). You could scale up to 100B parameters and still get only 40% accuracy on MATH.
The question: If scaling model size isn’t working for hard problems, what else can we do?
This paper’s answer: Stop thinking about making bigger models. Start thinking about making models that spend more time thinking.
The insight is profound: Test-time compute (compute spent at inference time) is a separate scaling axis from training-time compute.
This shifts the research agenda from “make bigger models” to “make models that can solve problems better by thinking longer.”