Section 04

The Math: RMSNorm, SwiGLU, and RoPE

LLaMA: Open and Efficient Foundation Language Models 2023

Prerequisite Tutorials


1. RMSNorm (Root Mean Square Normalization)

Standard LayerNorm

For reference, standard LayerNorm computes:

$$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$$

$$\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$$

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

$$y_i = \gamma \hat{x}_i + \beta$$

Where $\gamma, \beta$ are learnable parameters.

RMSNorm

RMSNorm simplifies this by removing the mean subtraction:

$$\text{RMS}(x) = \sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2}$$

$$\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x) + \epsilon} \otimes \gamma$$

where $\otimes$ is element-wise multiplication, $\gamma$ is a learnable scale, and $\epsilon$ is a small constant for numerical stability.

Key difference: RMSNorm only normalizes by the root mean square of the vector, not by the variance. No mean subtraction, no $\beta$ bias parameter.

Numerical Example

Input vector: $x = [2, -1, 3, 0]$ (dimension d = 4)

Step 1: Compute sum of squares $$\sum x_i^2 = 2^2 + (-1)^2 + 3^2 + 0^2 = 4 + 1 + 9 + 0 = 14$$

Step 2: Compute RMS $$\text{RMS}(x) = \sqrt{\frac{14}{4}} = \sqrt{3.5} \approx 1.871$$

Step 3: Normalize (assuming $\gamma = 1$, $\epsilon = 0$) $$\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} = \frac{[2, -1, 3, 0]}{1.871} = [1.069, -0.535, 1.604, 0.000]$$

Verification: Check that the RMS of the output is 1: $$\text{RMS}(\text{output}) = \sqrt{\frac{(1.069)^2 + (-0.535)^2 + (1.604)^2 + 0^2}{4}} = \sqrt{\frac{3.50}{4}} = \sqrt{0.875} \approx 1.0 \checkmark$$

Comparison: RMSNorm vs. LayerNorm

For the same input $x = [2, -1, 3, 0]$:

LayerNorm:

  • Mean: $\mu = (2 - 1 + 3 + 0) / 4 = 1.0$
  • Variance: $\sigma^2 = ((2-1)^2 + (-1-1)^2 + (3-1)^2 + (0-1)^2) / 4 = (1 + 4 + 4 + 1) / 4 = 2.5$
  • Std: $\sigma = \sqrt{2.5} = 1.581$
  • Output: $[(2-1)/1.581, (-1-1)/1.581, (3-1)/1.581, (0-1)/1.581] = [0.632, -1.265, 1.265, -0.632]$ (after scaling with $\gamma=1$)

RMSNorm:

  • RMS: $\sqrt{14/4} = 1.871$
  • Output: $[1.069, -0.535, 1.604, 0.000]$ (as computed above)

Both normalize, but LayerNorm centers around zero (output has mean ≈ 0), while RMSNorm does not. RMSNorm is simpler (no mean computation) and slightly faster.


2. SwiGLU Activation Function

Standard FFN with ReLU

In GPT-3, the feedforward network is:

$$\text{FFN}_{\text{ReLU}}(x) = \text{ReLU}(x W_1 + b_1) \cdot W_2 + b_2$$

where ReLU$(z) = \max(0, z)$.

SwiGLU FFN

In LLaMA, replace ReLU with SwiGLU:

$$\text{FFN}_{\text{SwiGLU}}(x) = (\text{Swish}(x W_1 + b_1)) \otimes (x W_2 + b_2)$$

where:

  • $\text{Swish}(z) = z \cdot \sigma(z)$ (Swish activation)
  • $\sigma(z) = 1 / (1 + e^{-z})$ (sigmoid function)
  • $\otimes$ is element-wise multiplication

The key difference: gating. The output of the first projection is gated (element-wise multiplied) by the output of a separate projection.

Numerical Example

Input: $x = 1.5$ (scalar, for simplicity)

Parameters: $W_1 = 2.0, b_1 = 0.5, W_2 = 3.0, b_2 = 0$

Step 1a: Compute pre-activation for first part $$z_1 = x W_1 + b_1 = 1.5 \cdot 2.0 + 0.5 = 3.5$$

Step 1b: Apply Swish $$\text{Swish}(z_1) = z_1 \cdot \sigma(z_1) = 3.5 \cdot \sigma(3.5)$$

where $\sigma(3.5) = 1 / (1 + e^{-3.5}) = 1 / (1 + 0.0302) = 0.9704$

$$\text{Swish}(3.5) = 3.5 \cdot 0.9704 = 3.396$$

Step 2: Compute gate $$z_2 = x W_2 + b_2 = 1.5 \cdot 3.0 + 0 = 4.5$$

Step 3: Multiply (gate) $$\text{FFN}_{\text{SwiGLU}}(1.5) = \text{Swish}(3.5) \otimes z_2 = 3.396 \cdot 4.5 = 15.28$$

For comparison, ReLU would give: $$\text{FFN}_{\text{ReLU}}(1.5) = \text{ReLU}(3.5) \cdot W_2 + b_2 = 3.5 \cdot 3.0 + 0 = 10.5$$

SwiGLU produces a higher value (15.28 vs. 10.5) due to the smooth Swish activation and the gating mechanism.

Why SwiGLU?

Empirically, SwiGLU shows:

  • Slightly better performance on language benchmarks (~2-3% improvements)
  • No dead units (unlike ReLU, which can output 0 for large negative values)
  • More parameter efficiency (gating allows selective feature usage)

3. Rotary Positional Embeddings (RoPE)

The Problem with Absolute Position Embeddings

Standard Transformers learn position embeddings $p_i$ for each position $i = 1, 2, \ldots, L$:

$$\text{input}_i = \text{embedding}(x_i) + p_i$$

Issues:

  • Only defined for positions up to training length L
  • Generalizes poorly to longer sequences (e.g., trained on 2048 tokens, cannot handle 4096)
  • Uses more parameters

Rotary Embeddings (RoPE)

Instead of adding position embeddings, rotate the query and key vectors by an angle proportional to position.

For position $m$, apply a 2D rotation:

$$\mathbf{R}(m, \theta) = \begin{bmatrix} \cos(m\theta) & -\sin(m\theta) \ \sin(m\theta) & \cos(m\theta) \end{bmatrix}$$

Then: $$q’_m = \mathbf{R}(m, \theta) \cdot q_m$$ $$k’_n = \mathbf{R}(n, \theta) \cdot k_n$$

where $q_m, k_n$ are query and key vectors (in practice, applied to pairs of dimensions).

Numerical Example: 2D Rotation

Query vector at position m=1: $q_1 = [1.0, 0.5]$

Angle basis: $\theta = 0.1$ rad/position

Position 1 angle: $1 \cdot 0.1 = 0.1$ rad

Rotation matrix for position 1: $$\mathbf{R}(1, 0.1) = \begin{bmatrix} \cos(0.1) & -\sin(0.1) \ \sin(0.1) & \cos(0.1) \end{bmatrix} = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix}$$

Rotated query: $$q’_1 = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.5 \end{bmatrix} = \begin{bmatrix} 0.995 - 0.0499 \ 0.0998 + 0.4975 \end{bmatrix} = \begin{bmatrix} 0.945 \ 0.597 \end{bmatrix}$$

Now, for key at position n=3:

Position 3 angle: $3 \cdot 0.1 = 0.3$ rad

$$\mathbf{R}(3, 0.1) = \begin{bmatrix} \cos(0.3) & -\sin(0.3) \ \sin(0.3) & \cos(0.3) \end{bmatrix} = \begin{bmatrix} 0.955 & -0.296 \ 0.296 & 0.955 \end{bmatrix}$$

If $k_3 = [1.0, 0.5]$: $$k’_3 = \begin{bmatrix} 0.955 & -0.296 \ 0.296 & 0.955 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.5 \end{bmatrix} = \begin{bmatrix} 0.955 - 0.148 \ 0.296 + 0.4775 \end{bmatrix} = \begin{bmatrix} 0.807 \ 0.774 \end{bmatrix}$$

Attention between position 1 and 3: $$\text{score} = q’_1 \cdot k’_3 = 0.945 \cdot 0.807 + 0.597 \cdot 0.774 = 0.763 + 0.462 = 1.225$$

The key insight: This score depends on the relative distance (3 - 1 = 2), not absolute positions. If we apply the same angle difference ($0.2$ rad), we get the same attention score regardless of starting position.

Generalization Property

Because RoPE encodes only relative position (distance), a model trained on sequences of length 2048 can generalize to 4096 or longer:

  • Training: sequence length 2048, max angle difference = 2048 × 0.1
  • Testing: sequence length 4096, max angle difference = 4096 × 0.1 (larger angle, but still interpretable as “relative position”)

With learned absolute embeddings, you have no way to represent positions beyond 2048.


Summary: The Mathematical Improvements

ComponentBenefit
RMSNormSimpler, faster than LayerNorm; no mean subtraction; fewer parameters
SwiGLUSmoother activation; gating mechanism; ~2-3% better performance
RoPEEncodes only relative position; generalizes to longer sequences

None is revolutionary alone, but together they make training more efficient and inference faster while maintaining or improving quality.