The Idea: Power Laws
The core insight is surprisingly simple: Loss follows a power law with respect to model size, data size, and compute.
What is a Power Law?
A power law is a relationship of the form:
y = a * x^b
where:
y = output (loss)
x = input (model size, or data size)
a = constant (determines the height of the curve)
b = exponent (determines the slope)
Example:
L(N) = 0.5 * N^(-0.076)
If N = 1B: L = 0.5 * 1B^(-0.076) ≈ 0.5
If N = 10B: L = 0.5 * 10B^(-0.076) ≈ 0.44
If N = 100B: L = 0.5 * 100B^(-0.076) ≈ 0.38
Notice: As N increases, L decreases. The bigger the model, the lower the loss (better).
Why Power Laws are Special
Power laws have a special property: they’re straight lines on log-log axes.
Normal axis: Log-log axis:
L log(L)
^ ^
| \ | \
| \ | \
| \___ | \___
+------> N +------> log(N)
Linear on log-log = power law on regular axes
This is useful because:
- Easy to fit: Plot your data on log-log axes, fit a line, extract the exponent.
- Easy to extrapolate: Extend the line to predict performance at larger scales.
- Universal pattern: Different domains (vision, speech, NLP) often show power laws.
The Paper’s Finding
The authors trained models with these parameters:
Model sizes (N): 1M, 10M, 100M, 1B, 10B, 100B, ...
Dataset sizes (D): 1M, 10M, 100M, 1B, 10B, 100B, ...
For each (N, D) pair, they measured cross-entropy loss. The results:
Loss vs. Model Size:
L(N) ∝ N^(-α_N) where α_N ≈ 0.076
Loss vs. Dataset Size:
L(D) ∝ D^(-α_D) where α_D ≈ 0.103
Loss vs. Compute:
L(C) ∝ C^(-α_C) where α_C ≈ 0.16 (derived from above)
The key discovery: These relationships are smooth, consistent, and log-linear across 7 orders of magnitude (millions to hundreds of billions of parameters).
The Intuition
Why might this be true?
Pre-training as compression: Language models compress the data. A larger model can store more patterns. Adding parameters lets it compress more, reducing loss.
But with diminishing returns: Each additional parameter is less useful than the previous one. The curve is decreasing but slowing down (hence the power law, not linear).
Data similarly: More data means more patterns to learn. A model can fit and generalize from more data. But again, each additional token is less informative than the previous one.
The power-law exponents (0.076 for N, 0.103 for D) quantify this diminishing return.
Why This Was Surprising
Before this paper, the prevailing belief was that loss would level off—hit a plateau where adding more parameters gives almost no benefit.
Instead, the paper showed: Loss keeps decreasing smoothly. No plateau visible up to 100B+ parameters.
This suggested: “We haven’t found the plateau yet. We should keep scaling.”
This finding justified GPT-3’s 175B parameters. If the laws hold, a 175B model would indeed show much lower loss than 70B.
The Compute-Optimal Frontier
If you have a fixed compute budget C, there’s an optimal way to split it between N and D.
Too many parameters, too little data:
- The model is too large relative to its training.
- Overfit risk.
- Loss doesn’t decrease as fast.
Too little parameters, too much data:
- The model is too small to capture the data.
- Underfit.
- Loss plateaus early.
Optimal balance: The paper found empirically that the optimal allocation is roughly:
N_opt ∝ C^0.73
D_opt ∝ C^0.27
Or equivalently:
N_opt ∝ D_opt^2.7
(This means parameters should be about 2.7x larger than data tokens in exponent space.)
What this means: If you double your compute budget, you should:
- Increase model size by a factor of 2^0.73 ≈ 1.66x (66% bigger)
- Increase data size by a factor of 2^0.27 ≈ 1.21x (21% more)
The Indian Analogy
Imagine a student studying for an exam:
- Study time (compute): Fixed at 20 hours.
- Number of subjects (model size): How many subjects to prepare?
- Number of problems (data): How many practice problems per subject?
If the student prepares all 20 subjects with no practice: knows the topics, but can’t solve problems. Loss is high.
If the student does 1,000 practice problems on just 2 subjects: becomes expert in 2 subjects, fails on 18 others. Loss is high.
Optimal strategy: Prepare ~15 subjects (73% of effort) and do ~100 practice problems per subject (27% of effort). This balances breadth and depth.
The scaling laws are the study strategy that maximizes exam performance for a given study time.
Key Takeaways from This Section
- Power law: Loss ∝ N^(-0.076), Loss ∝ D^(-0.103)
- Log-linear on log-log: Plot loss vs. size on log-log axes, you get a straight line.
- No plateau: Loss keeps decreasing at large scales; the curve is smooth.
- Optimal ratio: For a given compute budget, allocate 73% to model size and 27% to data size.
- Surprising finding: The laws hold across 7 orders of magnitude, enabling extrapolation.
Next: Section 04: The Math