Section 02

The Problem: Inefficient Scaling and Closed Access

LLaMA: Open and Efficient Foundation Language Models 2023

LLaMA tackled two related problems:

Problem 1: Suboptimal Model Scaling

The old assumption: To get better language models, train bigger models.

  • GPT-2: 1.5B parameters
  • GPT-3: 175B parameters
  • PaLM: 540B parameters

The pattern: bigger = better. So keep scaling up.

The issue: From Chinchilla (Paper 13), we know this is suboptimal. When you have a fixed compute budget, you should:

  1. Train a smaller model
  2. On MORE data
  3. For longer

Concrete example: Given 1 exaflop of compute (roughly the budget to train GPT-3):

Old approach (GPT-3 style):

  • Model: 175B parameters
  • Training tokens: ~300B
  • Cost: ~1 exaflop
  • Result: Strong, but data-starved

Chinchilla/LLaMA approach:

  • Model: 7-13B parameters (100-200x smaller!)
  • Training tokens: 1.4 trillion (5x more!)
  • Cost: ~1 exaflop (same budget)
  • Result: Better performance with fewer parameters

Why was GPT-3 suboptimal? Because 300B tokens is not enough data for a 175B parameter model. The model can memorize much of the training data and still have room left over. By using a smaller model (which has less memorization capacity) and much more data (1.4T tokens), the Chinchilla team found a sweet spot where the data is fully utilized.

The implication: You do not need a 175B model to match GPT-3. A 13B model trained on 1.4T tokens will do better.


Problem 2: Proprietary Models Block Research

Closed access: In late 2022, the state-of-the-art models were:

  • GPT-3 (OpenAI): Closed. Only accessible via API. Weights not public.
  • PaLM (Google): Closed. Weights not public.
  • Chinchilla (DeepMind): Closed. Weights not public.

What could researchers do?

  1. Use smaller open models: Models like BLOOM (176B) were open but less capable than GPT-3.
  2. Call the API: Use OpenAI’s API for GPT-3, but this was expensive (~$0.02 per 1K tokens) and rate-limited.
  3. Train from scratch: Train your own model, but this requires massive compute (thousands of GPUs, millions of dollars).

The problem for researchers:

  • A PhD student at a university in Pune cannot train a 175B model; their institution lacks the compute.
  • They cannot fine-tune GPT-3 weights (weights not available); they can only call the API.
  • They cannot study what GPT-3 learned or make architectural modifications.
  • Innovation is concentrated at a few labs with massive compute.

Barriers to progress:

  • Only researchers at OpenAI, Google, DeepMind, and a few other labs can push the frontier.
  • Thousands of talented researchers globally are locked out of working with frontier models.
  • The field becomes less diverse; ideas come from fewer institutions.

Architectural Inefficiencies

Beyond scaling, LLaMA addressed known architectural inefficiencies:

LayerNorm Is Slow and Unstable

Standard LayerNorm (used in GPT-3):

For input x:
  1. Compute mean: μ = (1/d) Σ x_i
  2. Compute variance: σ² = (1/d) Σ (x_i - μ)²
  3. Normalize: x̂ = (x - μ) / √(σ² + ε)
  4. Scale: y = γ * x̂ + β  (learnable γ, β)

This requires:

  • Computing mean (multiple operations)
  • Computing variance (multiple operations)
  • Division (expensive)
  • Storing three parameter tensors (γ, β, and the layer’s weights)

Pre-normalization vs. Post-normalization:

Post-norm (GPT-3 style):

Self-Attention → Add → LayerNorm → FFN → Add → LayerNorm
                (post-norm after addition)

Pre-norm (better):

LayerNorm → Self-Attention → Add → LayerNorm → FFN → Add
(applied before, not after)

Pre-norm is more stable during training because you normalize before the operation, preventing activation explosion.

SwiGLU vs. ReLU

In the feedforward network, GPT-3 used standard ReLU:

FFN(x) = ReLU(x W + b) V

LLaMA uses SwiGLU (Swish-Gated Linear Unit):

FFN(x) = Swish(x W + b) ⊙ (x V + c)
where Swish(x) = x * sigmoid(x)

SwiGLU has been shown to improve model quality slightly while using similar compute.

Fixed Positional Embeddings vs. Learned Absolute vs. Rotary

GPT-3 learned absolute positional embeddings. LLaMA uses Rotary Positional Embeddings (RoPE), which encode position via rotation in the query/key vector space. RoPE generalizes better to unseen sequence lengths (e.g., a model trained on 2048 tokens can handle 4096 tokens more gracefully).


Summary: Two Concrete Problems

  1. Training inefficiency: Models like GPT-3 are trained with suboptimal compute allocation. A smaller model on more data would perform better.

  2. Access bottleneck: State-of-the-art models are proprietary and closed. Most researchers cannot access frontier-level models, limiting innovation and diversity in the field.

LLaMA addresses both: it’s a smaller, more efficient model that is openly released.