Limitations of Scaling Laws
The scaling laws are powerful, but they have real limitations. Understanding them prevents overconfidence.
Limitation 1: The Exponents Are Approximate
The paper found α_N ≈ 0.076, but this is an average across experiments. Individual runs vary:
- Some models: α_N = 0.070
- Some models: α_N = 0.082
When predicting at very large scales (1 trillion parameters), small errors in α compound. A 0.006 difference in the exponent can lead to 10–20% errors in predicted loss.
Impact: The laws are reliable for planning, but don’t trust a single predicted number. Use error bands (confidence intervals).
Limitation 2: The Laws Don’t Account for Data Quality
The formulas assume all tokens are equally useful:
L(D) = a * D^(-0.103)
But in reality, 1 trillion tokens of high-quality Wikipedia text ≠ 1 trillion tokens of garbage text.
Real-world example:
- 100B tokens of random internet text (low quality)
- 50B tokens of books + structured data (high quality)
A model trained on 50B high-quality tokens might outperform one trained on 100B low-quality tokens. The laws don’t capture this.
Impact: You can’t just collect any 1 trillion tokens. Data curation matters. The laws provide a lower bound; good data can beat predictions.
Limitation 3: The Laws Don’t Account for Different Architectures
The paper focused on decoder-only Transformers. Do the same laws hold for:
- Encoder-only models (BERT)?
- Encoder-decoder models (T5)?
- Mixture-of-Experts models?
- Attention-free models (State Space Models, RNNs)?
Partial evidence: Scaling laws seem to generalize across Transformers, but the exponents (α_N, α_D) might differ slightly for different architectures.
Impact: If you switch to a different architecture, the predicted losses might be off. You’d need to re-calibrate on that architecture.
Limitation 4: The Laws Assume Dense Training
The compute formula C ≈ 6 * N * D assumes dense models (all parameters used in every forward pass). Modern models use:
- Mixture-of-Experts (MoE): Only a fraction of parameters are active per forward pass. Actual compute is lower.
- Sparse Attention: Not all tokens attend to all tokens. Compute is lower.
- Quantization: Use lower-precision numbers. Compute is lower.
For these architectures, C ≈ 6 * N * D doesn’t hold exactly.
Impact: If you use MoE or sparse architectures, the effective compute is lower than the formula suggests. You need architecture-specific constants.
Limitation 5: Compute-Optimal Ratio Was Later Refined (Chinchilla, 2022)
This paper found: “Optimal ratio is N ∝ C^0.73, D ∝ C^0.27.”
But DeepMind’s Chinchilla paper (2022) re-examined this and found the optimal ratio is closer to:
- N ∝ C^0.67
- D ∝ C^0.33
Implication: Allocate more data relative to parameters than this paper suggested. GPT-3 was even more compute-suboptimal than previously thought.
Impact: This paper’s design decisions were superseded. But the methodology (run experiments, fit power laws) remains valid.
Limitation 6: Loss Doesn’t Directly Predict Benchmark Performance
The scaling laws predict loss (cross-entropy on the test set). But real task performance (accuracy on classification, BLEU on translation, correctness on math) doesn’t scale smoothly with loss.
Example:
Loss 1.5 bits per token → 60% accuracy on sentiment
Loss 1.4 bits per token → 65% accuracy
Loss 1.3 bits per token → 75% accuracy (phase transition!)
Loss 1.2 bits per token → 76% accuracy (returns to smooth)
Sometimes there are emergent abilities or phase transitions where performance jumps suddenly at a certain scale. The smooth loss curve doesn’t capture these.
Impact: Predicting loss is useful but incomplete. For specific tasks, you need task-specific benchmarks.
Limitation 7: The Laws Break at Extreme Scales
The experiments went up to ~100B parameters. But what about 1 trillion? 10 trillion?
Unknown unknowns:
- Do the laws still hold?
- Do new phenomena (phase transitions, different optimal ratios) emerge?
- Does the dataset become the bottleneck (we run out of text)?
At extreme scales, the assumptions might break.
Impact: The laws are trustworthy up to ~200B parameters (verified empirically). Beyond that, extrapolation is riskier.
Limitation 8: Doesn’t Account for Inference Cost
The paper focuses on pre-training compute (C ≈ 6 * N * D). But once trained, the model must be run at inference (serving users).
Inference compute is very different:
- A 175B parameter model is expensive to run in production
- Smaller models with more data might be cheaper to serve
The laws don’t optimize for total cost (pre-training + inference). A compute-optimal training configuration might be sub-optimal for deployment.
Impact: In practice, you might want a smaller, faster model even if it trains less efficiently. The laws are pre-training-centric.
Key Takeaways from This Section
- Exponents are approximate: Use confidence bands, not point estimates.
- Data quality matters: The laws assume all tokens are equal; they’re not.
- Architecture-dependent: Different Transformer variants might have different exponents.
- Dense training assumed: MoE and sparse models don’t follow the C ≈ 6ND formula exactly.
- Chinchilla refined the ratio: Allocate more data than this paper suggested.
- Loss ≠ task performance: Loss is smooth, but benchmark performance can have phase transitions.
- Breaks at extremes: The laws are verified up to ~100B; extrapolation beyond is uncertain.
- Doesn’t optimize inference: Pre-training efficiency ≠ deployment efficiency.
These limitations don’t invalidate the scaling laws. They just clarify their scope and assumptions.
Next: Section 08: Impact