Section 02

The Problem: How to Allocate Compute?

Scaling Laws for Neural Language Models 2020

The Problem: How to Allocate Compute?

If you have a fixed compute budget, how do you allocate it between:

  • Model size (N): How many parameters?
  • Dataset size (D): How many training tokens?

This is the central optimization problem.

The Trade-Off

Scenario 1: Use all compute for a huge model, small dataset

Compute budget: C = 10^21 FLOPs
Allocation: N = 100B parameters, D = 100M tokens

Problem: The model is vastly undertrained.
A 100B parameter model on 100M tokens will severely overfit.
Performance: Poor.

Scenario 2: Use all compute for a small model, huge dataset

Compute budget: C = 10^21 FLOPs
Allocation: N = 1B parameters, D = 1 trillion tokens

Problem: The model is too small to capture the data.
A 1B parameter model can't learn all patterns in 1T tokens.
Performance: Poor.

Scenario 3: Balanced allocation

Compute budget: C = 10^21 FLOPs
Allocation: N = 70B parameters, D = 200B tokens

Better balanced.
Performance: Good.

The question: What is the optimal ratio for a given compute budget?

Before this paper: Nobody knew. Researchers guessed or used heuristics.

The Bottleneck Question

Which usually bottlenecks first as you scale?

Belief 1 (Data bottleneck): “We’ll run out of high-quality text data before we run out of compute.”

  • If true: Focus on data efficiency, not model size.

Belief 2 (Compute bottleneck): “We can always collect more data (web crawling, books), but compute is limited.”

  • If true: Focus on model efficiency; use available compute wisely.

Belief 3 (They’re balanced): “Both scale together proportionally.”

  • If true: There’s an optimal ratio, and deviating from it hurts performance.

This paper had to answer: Which constraint is real?

Why This Matters Practically

Imagine you’re training a large language model. You have a budget of $10 million.

  • A 100B model costs $5M to pre-train. You have $5M left for compute.
  • Do you:
    • (A) Train one 100B model on 200B tokens? (use all $10M)
    • (B) Train two 50B models on 300B tokens each? (distribute differently)
    • (C) Train one 200B model on 100B tokens? (bigger model, less data)

Without knowing the scaling laws, you’re guessing. With them, you calculate the expected loss for each option and pick the best.

The Real Question

Is the relationship between (N, D) and loss smooth and predictable, or is it chaotic and task-dependent?

If smooth: You can extrapolate. “If we scale this parameter size to this data size, we’ll get this loss.”

If chaotic: You’d need to run experiments at every scale to know what works.

This paper’s answer changed everything.


Key Takeaways from This Section

  • The trade-off: Larger models need more data; smaller models can use less data.
  • The optimization problem: Given a compute budget C, find the optimal (N, D) pair.
  • The uncertainty: Is there a fixed optimal ratio? Or does it depend on the task and data?
  • The practical stakes: Billions of dollars are allocated based on the answer.

Next: Section 03: The Idea