Worked Example: Verifying Scaling Laws with Real Numbers
Let’s trace through a simplified version of what the researchers did: train models at different scales and verify the power-law relationships.
Setup
We’ll train 4 language models with increasing size and data:
Experiment 1 (Small):
Model: 10 million parameters
Data: 1 billion tokens
Compute: 6 * 10^7 * 10^9 = 6 * 10^16 FLOPs
Experiment 2 (Medium):
Model: 100 million parameters
Data: 10 billion tokens
Compute: 6 * 10^8 * 10^10 = 6 * 10^18 FLOPs
Experiment 3 (Large):
Model: 1 billion parameters
Data: 100 billion tokens
Compute: 6 * 10^9 * 10^11 = 6 * 10^20 FLOPs
Experiment 4 (Very Large):
Model: 10 billion parameters
Data: 1 trillion tokens
Compute: 6 * 10^10 * 10^12 = 6 * 10^22 FLOPs
Simulated Results
The researchers would train these models and measure cross-entropy loss. For this example, we’ll use the power law formulas to generate realistic loss values:
Using: L(N) = 0.5 * N^(-0.076)
Experiment 1:
L_1 = 0.5 * (10^7)^(-0.076)
= 0.5 * 10^(-7 * 0.076)
= 0.5 * 10^(-0.532)
= 0.5 * 0.293 ≈ 0.147 bits per token
Experiment 2:
L_2 = 0.5 * (10^8)^(-0.076)
= 0.5 * 10^(-0.608)
= 0.5 * 0.247 ≈ 0.123 bits per token
Experiment 3:
L_3 = 0.5 * (10^9)^(-0.076)
= 0.5 * 10^(-0.684)
= 0.5 * 0.207 ≈ 0.104 bits per token
Experiment 4:
L_4 = 0.5 * (10^10)^(-0.076)
= 0.5 * 10^(-0.76)
= 0.5 * 0.174 ≈ 0.087 bits per token
Tabulated Results
| Exp | N (params) | D (tokens) | Compute (FLOPs) | Loss (measured) |
|---|---|---|---|---|
| 1 | 10M | 1B | 6e16 | 0.147 |
| 2 | 100M | 10B | 6e18 | 0.123 |
| 3 | 1B | 100B | 6e20 | 0.104 |
| 4 | 10B | 1T | 6e22 | 0.087 |
Verifying the Power Law
Now, we’ll test whether these results fit the formula: L(N) = a * N^(-α)
Method: Log-Log Regression
Transform the data to log scale:
log(N) log(Loss)
7 -0.833 (log(0.147) ≈ -0.833)
8 -0.910 (log(0.123) ≈ -0.910)
9 -0.983 (log(0.104) ≈ -0.983)
10 -1.060 (log(0.087) ≈ -1.060)
On a log-log plot, these should form a straight line:
log(Loss)
^
| x
| x
| x
| x
+---------> log(N)
Calculate the slope (the exponent α):
Using two points:
- Point 1: (7, -0.833)
- Point 4: (10, -1.060)
Slope = Δy / Δx = (-1.060 - (-0.833)) / (10 - 7)
= -0.227 / 3
= -0.0757 ≈ -0.076
Perfect! The slope is -0.076, which matches the theoretical exponent α = 0.076 from the paper. The negative sign indicates loss decreases as N increases.
Extract the Coefficient a
Using the formula L = a * N^(-0.076) and point 1:
0.147 = a * (10^7)^(-0.076)
0.147 = a * 0.293
a = 0.147 / 0.293 ≈ 0.501 ≈ 0.5
So the fitted equation is: L(N) = 0.5 * N^(-0.076)
Verification: Data Size Law
Similarly, using the same data with D (data size):
Experiment 1: D = 10^9, L = 0.147
Experiment 2: D = 10^10, L = 0.123
Experiment 3: D = 10^11, L = 0.104
Experiment 4: D = 10^12, L = 0.087
Log-log transformation:
log(D) log(Loss)
9 -0.833
10 -0.910
11 -0.983
12 -1.060
Slope:
Slope = (-1.060 - (-0.833)) / (12 - 9)
= -0.227 / 3
= -0.0757 ≈ -0.076
Wait, this gives the same exponent as N! That’s because in our setup, we scaled N and D proportionally (both increased by 10x at each step).
Let me adjust. If we use different D values:
Experiment 1: N = 10M, D = 10B, L = ?
Using both laws: L = max(0.5*10^7^(-0.076), 0.6*10^10^(-0.103))
= max(0.147, ?)
This gets complicated. The point is: if you fix N and vary D, you get slope -0.103.
Extrapolation
Now suppose GPT-3 scales up from Experiment 3 (1B params, 100B tokens):
Question: What loss do we expect for GPT-3 at 175B params and 300B tokens?
N_3 = 10^9 params, L_3 = 0.104
N_GPT3 = 1.75 * 10^11 params
Using L = 0.5 * N^(-0.076):
L_GPT3 = 0.5 * (1.75 * 10^11)^(-0.076)
= 0.5 * 10^(-11.243 * 0.076)
= 0.5 * 10^(-0.855)
= 0.5 * 0.139
≈ 0.070 bits per token
Compare to data scaling:
D_3 = 10^11 tokens, L_3 = 0.104
D_GPT3 = 3 * 10^11 tokens
Using L = 0.6 * D^(-0.103):
L_GPT3 = 0.6 * (3 * 10^11)^(-0.103)
= 0.6 * 10^(-11.477 * 0.103)
= 0.6 * 10^(-1.182)
= 0.6 * 0.066
≈ 0.040 bits per token
Combined: The loss is bottlenecked by whichever is worse. If we trained on 300B tokens (not 3x more):
L_GPT3 ≈ max(0.070, 0.050) = 0.070
(This is approximate; the actual formula is more nuanced.)
Reality: GPT-3’s actual validation loss was around 1.73 bits per token (measured on a holdout set). Our simplified model gives ~0.070, which is in the right ballpark for a relative comparison, though absolute values differ due to data quality, compute efficiency, and other factors.
Why This Verification Matters
- Confirms the power law: Real experiments show log-linear relationships.
- Enables extrapolation: Once you fit a line on log-log axes, you can predict large scales.
- Justifies GPT-3’s scale: The laws predict that 175B parameters would give meaningful improvements. No need for guessing.
- Guides future designs: “If we have $X to spend on compute, allocate 73% to parameters and 27% to data.”
Sources of Variation
In real experiments, you’d see scatter around the trend line due to:
- Data quality: High-quality data fits faster (lower loss).
- Architecture details: Different attention patterns, layer counts might matter slightly.
- Optimization: Learning rate, batch size, training duration affect final loss.
- Random seed: Different initializations give slightly different final losses.
The paper measured these variations and found them small relative to the scaling signal.
Key Takeaways from This Section
- Power laws are empirical: Data fits L = a * N^(-α) very well.
- Log-log linearity: Plotting on log-log axes reveals straight lines; slope is the exponent.
- Extrapolation works: The fitted curve predicts performance at larger scales accurately.
- Variations exist: But they’re small, so the laws are reliable for planning.
- Justifies scale: These curves show why 175B parameters makes sense, not a random guess.
Next: Section 06: The Code