Section 03

The Idea: Chinchilla Scaling, Architectural Improvements, and Open Release

LLaMA: Open and Efficient Foundation Language Models 2023

LLaMA’s innovation is not a single breakthrough but a combination of three choices:

1. Apply Chinchilla-Optimal Scaling

The principle: Given a fixed compute budget, allocate it across model size (N parameters) and data (D tokens) optimally.

From Chinchilla (Paper 13), the optimal allocation is roughly:

$$\text{Compute} \propto N \cdot D$$

where $N \approx \text{compute}^{\alpha}$ and $D \approx \text{compute}^{\alpha}$ for some $\alpha$ (roughly 0.5-0.67).

In practice: D ≈ 20N (training tokens should be about 20x the number of parameters).

LLaMA’s application:

LLaMA trained four models:

ModelParametersTraining TokensCompute (V100 days)
LLaMA-7B7B1.4T~1,000
LLaMA-13B13B1.4T~1,300
LLaMA-33B33B1.4T~1,700
LLaMA-65B65B1.4T~2,300

All models trained on the same 1.4 trillion tokens (roughly Chinchilla-optimal for the 13B model). The 65B model used ~2,300 V100 days, similar to GPT-3.

Why this works: By keeping data constant (1.4T) and varying model size, we explore the compute frontier. A smaller model (7B) might be undertrained (could use more data), a larger model (65B) might be slightly overtrained (has more capacity than data can fill). The 13-33B models hit a sweet spot.


2. Architectural Innovations

LLaMA made four key architectural changes:

A. Pre-Normalization with RMSNorm

Standard LayerNorm (post-norm):

x → Attention → + → LayerNorm → FFN → + → LayerNorm → output
     (layer 1)       (post)            (post)

LLaMA (pre-norm):

x → RMSNorm → Attention → + → RMSNorm → FFN → + → output
             (pre)                    (pre)

RMSNorm (Root Mean Square Normalization):

Instead of LayerNorm (which computes mean and variance), RMSNorm only computes the root mean square:

$$\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma$$

where $\text{RMS}(x) = \sqrt{\frac{1}{d} \sum_i x_i^2}$ and $\gamma$ is a learnable scale.

Advantages:

  • Simpler (no mean computation, no bias term)
  • Faster (fewer operations per forward pass)
  • More stable training (pre-norm > post-norm)
  • Reduces memory usage slightly

B. SwiGLU Activation Function

Replace the standard ReLU in the feedforward network with SwiGLU:

$$\text{FFN}_{\text{SwiGLU}}(x) = (\text{Swish}(x W_1 + b_1)) \otimes (x W_2 + b_2)$$

where:

  • $\text{Swish}(x) = x \cdot \sigma(x)$ (smooth activation instead of hard ReLU)
  • $\otimes$ is element-wise multiplication (gating)

Why SwiGLU?

  • Slightly better performance (experiments show ~2-3% improvement on benchmarks)
  • Smooth activation (no dead units like ReLU)
  • The gating mechanism allows the network to selectively use or suppress information

C. Rotary Positional Embeddings (RoPE)

Instead of learning absolute position embeddings, encode position via rotation in the query/key vector space.

The idea: For a position $m$ in the sequence, apply a rotation matrix $R(\theta_m)$ to the query and key vectors:

$$q’_m = R(\theta_m) \cdot q_m$$ $$k’_n = R(\theta_n) \cdot k_n$$

The attention score between positions m and n depends only on the relative distance (m-n), not absolute positions.

Advantages:

  • Generalizes to longer sequences (trained on 2048, can handle 4096+)
  • More interpretable (encodes relative positions explicitly)
  • Simpler than learned embeddings (fewer parameters)

D. Grouped Query Attention (in later versions)

LLaMA 1 uses standard multi-head attention. LLaMA 2 introduced grouped query attention (fewer key/value heads than query heads), which speeds up inference without much quality loss. This is a minor innovation but important for efficiency.


3. Training on Public Data Only

Key decision: Use only publicly available data.

  • Source: CommonCrawl, GitHub, Wikipedia, ArXiv, Books (all publicly available)
  • Total: 1.4 trillion tokens from diverse, publicly available sources
  • No proprietary data: Unlike GPT-3 (which used private datasets), LLaMA trained entirely on public data

Why this matters:

  • Reproducible: anyone can download the same data sources
  • Legally clearer: no licensing issues
  • Transparent: the community can audit what the model learned from

4. Open-Source Release

The final innovation: Publish the weights.

Meta released LLaMA weights (with research-only licensing, later commercialized in LLaMA 2) on Hugging Face. This allowed:

  • Researchers: Fine-tune the model for experiments
  • Developers: Build applications without API calls
  • Community: Understand, critique, and improve the model
  • Entrepreneurs: Create startups based on open LLaMA weights (Replicate, Together, etc.)

Indian Analogy: The Multi-Pronged Strategy

Imagine a student trying to study for JEE exams:

  1. Better study method (Chinchilla scaling): Study smarter, not just longer. Allocate study time efficiently across all topics.

  2. Better study tools (architecture improvements): Use better pens, better notebooks, better lighting. Small improvements in tools add up.

  3. Public resources (open data): Study from freely available resources (books, YouTube) instead of expensive coaching centers.

  4. Share knowledge (open release): Publish your notes on GitHub. Help other students. Create a community.

Individually, none of these is revolutionary. Together, they create a student who’s more capable, more efficient, and more visible than before.


The Insight: Efficiency Over Size

The core insight is: Efficiency beats raw scale.

GPT-3 threw massive compute at the problem: 175B parameters, but only 300B tokens (not enough). LLaMA threw the same compute at a smarter allocation: 13-65B parameters, 1.4T tokens (well-utilized). The result: better models, more accessible.

This reflected a shift in the field from “Bigger is always better” to “Smart allocation is better.” This principle now dominates LLM design.