Section 05

Worked Example: Computing RMSNorm, SwiGLU, and RoPE

LLaMA: Open and Efficient Foundation Language Models 2023

Let’s trace through a complete example showing all three key operations.


Example: Processing a Single Token

Setup: We have a token embedding that flows through one transformer layer. We’ll trace:

  1. Pre-normalization with RMSNorm
  2. Self-attention (simplified)
  3. SwiGLU feedforward

Simplified scenario:

  • Embedding dimension: d = 4 (normally 4096, but 4 for manual computation)
  • Batch size: 1
  • Sequence length: 1 (single token)

Step 1: Input Embedding

Raw embedding: $x = [0.5, -1.2, 0.8, 0.3]$

This comes from embedding the token “LLaMA”.


Step 2: Pre-Normalization (RMSNorm)

Operation: Normalize the input before attention.

Computation:

Sum of squares: $$\sum x_i^2 = (0.5)^2 + (-1.2)^2 + (0.8)^2 + (0.3)^2 = 0.25 + 1.44 + 0.64 + 0.09 = 2.42$$

RMS: $$\text{RMS}(x) = \sqrt{\frac{2.42}{4}} = \sqrt{0.605} \approx 0.778$$

Normalized (assuming $\gamma = 1.0$): $$\text{RMSNorm}(x) = \frac{[0.5, -1.2, 0.8, 0.3]}{0.778} = [0.643, -1.543, 1.029, 0.386]$$

After RMSNorm: $x_{\text{norm}} = [0.643, -1.543, 1.029, 0.386]$


Step 3: Self-Attention (Simplified)

In a real transformer, we compute Query, Key, and Value projections, apply attention, and get the output. For brevity, let’s say attention outputs:

$$\text{attention_output} = [0.6, -1.5, 1.0, 0.4]$$

(In reality, this would be computed via dot-product attention, but the process is the same.)

Residual connection: $$\text{after_attention} = x + \text{attention_output} = [0.5, -1.2, 0.8, 0.3] + [0.6, -1.5, 1.0, 0.4] = [1.1, -2.7, 1.8, 0.7]$$


Step 4: Pre-Normalization Again (Before FFN)

Input to RMSNorm: $y = [1.1, -2.7, 1.8, 0.7]$

Sum of squares: $$\sum y_i^2 = (1.1)^2 + (-2.7)^2 + (1.8)^2 + (0.7)^2 = 1.21 + 7.29 + 3.24 + 0.49 = 12.23$$

RMS: $$\text{RMS}(y) = \sqrt{\frac{12.23}{4}} = \sqrt{3.0575} \approx 1.749$$

Normalized: $$\text{RMSNorm}(y) = \frac{[1.1, -2.7, 1.8, 0.7]}{1.749} = [0.629, -1.544, 1.030, 0.400]$$

After RMSNorm: $y_{\text{norm}} = [0.629, -1.544, 1.030, 0.400]$


Step 5: SwiGLU Feedforward

Operation: $\text{FFN}{\text{SwiGLU}}(y{\text{norm}}) = \text{Swish}(y_{\text{norm}} W_1 + b_1) \otimes (y_{\text{norm}} W_2 + b_2)$

Simplified parameters:

  • $W_1$: Projects input (4D) to intermediate (8D), then we’ll just compute 2 dimensions
  • $W_2$: Projects input (4D) to intermediate (8D)

For manual computation, let’s use smaller matrices:

First projection (gate input): $$z_1 = y_{\text{norm}} \cdot W_1 + b_1$$

Using $W_1 = [0.5, -0.3; 0.2, 0.4; -0.1, 0.6; 0.3, -0.2]$ (4x2 matrix) and $b_1 = [0.1, -0.1]$:

For dimension 1: $$z_{1,1} = 0.629 \cdot 0.5 + (-1.544) \cdot (-0.3) + 1.030 \cdot (-0.1) + 0.400 \cdot 0.3$$ $$= 0.315 + 0.463 - 0.103 + 0.120 = 0.795$$ $$\text{After bias: } 0.795 + 0.1 = 0.895$$

For dimension 2: $$z_{1,2} = 0.629 \cdot (-0.3) + (-1.544) \cdot 0.4 + 1.030 \cdot 0.6 + 0.400 \cdot (-0.2)$$ $$= -0.189 - 0.618 + 0.618 - 0.080 = -0.269$$ $$\text{After bias: } -0.269 - 0.1 = -0.369$$

So $z_1 = [0.895, -0.369]$

Apply Swish activation: $$\text{Swish}(z) = z \cdot \sigma(z) = z \cdot \frac{1}{1 + e^{-z}}$$

For $z_{1,1} = 0.895$: $$\sigma(0.895) = \frac{1}{1 + e^{-0.895}} = \frac{1}{1 + 0.407} = 0.711$$ $$\text{Swish}(0.895) = 0.895 \cdot 0.711 = 0.636$$

For $z_{1,2} = -0.369$: $$\sigma(-0.369) = \frac{1}{1 + e^{0.369}} = \frac{1}{1 + 1.447} = 0.408$$ $$\text{Swish}(-0.369) = -0.369 \cdot 0.408 = -0.151$$

So $\text{Swish}(z_1) = [0.636, -0.151]$

Second projection (gate): $$z_2 = y_{\text{norm}} \cdot W_2 + b_2$$

Using a different weight matrix $W_2 = [0.4, 0.2; -0.1, 0.5; 0.3, -0.2; -0.2, 0.4]$ and $b_2 = [0, 0.05]$:

For dimension 1: $$z_{2,1} = 0.629 \cdot 0.4 + (-1.544) \cdot (-0.1) + 1.030 \cdot 0.3 + 0.400 \cdot (-0.2)$$ $$= 0.252 + 0.154 + 0.309 - 0.080 = 0.635$$ $$\text{After bias: } 0.635 + 0 = 0.635$$

For dimension 2: $$z_{2,2} = 0.629 \cdot 0.2 + (-1.544) \cdot 0.5 + 1.030 \cdot (-0.2) + 0.400 \cdot 0.4$$ $$= 0.126 - 0.772 - 0.206 + 0.160 = -0.692$$ $$\text{After bias: } -0.692 + 0.05 = -0.642$$

So $z_2 = [0.635, -0.642]$

Element-wise multiplication (gating): $$\text{FFN}_{\text{SwiGLU}} = \text{Swish}(z_1) \otimes z_2 = [0.636, -0.151] \otimes [0.635, -0.642]$$ $$= [0.636 \cdot 0.635, -0.151 \cdot (-0.642)] = [0.404, 0.097]$$

FFN output: $[0.404, 0.097]$ (in 2D for this example; normally 8D)


Step 6: Residual and Next Token

Final output before next layer: $$\text{output} = y + \text{FFN}_{\text{SwiGLU}} = [1.1, -2.7, 1.8, 0.7] + [0.404, 0.097, \ldots]$$

(In reality, the FFN output would be the same dimension as the input, so we’d add all components.)


Step 7: RoPE Example (Attention Computation)

Now let’s see how RoPE affects the attention computation for a 2-token sequence.

Token 1 query (after attention head projection): $q_1 = [1.0, 0.5]$

Token 1 key: $k_1 = [0.8, 0.6]$

Token 2 query: $q_2 = [0.9, 0.7]$

Token 2 key: $k_2 = [0.7, 0.5]$

RoPE angle basis: $\theta = 0.1$ rad/position

Without RoPE (Absolute position embeddings)

Add learned position embeddings:

  • $p_1 = [0.1, 0.05]$
  • $p_2 = [0.15, 0.08]$

Then: $$q’_1 = q_1 + p_1 = [1.1, 0.55]$$ $$k’_1 = k_1 + p_1 = [0.9, 0.65]$$ $$q’_2 = q_2 + p_2 = [1.05, 0.78]$$ $$k’_2 = k_2 + p_2 = [0.85, 0.58]$$

Attention score between $q_2$ and $k_1$: $$\text{score} = q’_2 \cdot k’_1 = 1.05 \cdot 0.9 + 0.78 \cdot 0.65 = 0.945 + 0.507 = 1.452$$

With RoPE

Apply rotation at each position:

Position 1, angle = 0.1: $$R(0.1) = \begin{bmatrix} \cos(0.1) & -\sin(0.1) \ \sin(0.1) & \cos(0.1) \end{bmatrix} = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix}$$

$$q’_1 = R(0.1) \cdot q_1 = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.5 \end{bmatrix} = \begin{bmatrix} 0.995 - 0.0499 \ 0.0998 + 0.4975 \end{bmatrix} = \begin{bmatrix} 0.945 \ 0.597 \end{bmatrix}$$

$$k’_1 = R(0.1) \cdot k_1 = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix} \begin{bmatrix} 0.8 \ 0.6 \end{bmatrix} = \begin{bmatrix} 0.796 - 0.0599 \ 0.0798 + 0.597 \end{bmatrix} = \begin{bmatrix} 0.736 \ 0.677 \end{bmatrix}$$

Position 2, angle = 0.2: $$R(0.2) = \begin{bmatrix} \cos(0.2) & -\sin(0.2) \ \sin(0.2) & \cos(0.2) \end{bmatrix} = \begin{bmatrix} 0.980 & -0.199 \ 0.199 & 0.980 \end{bmatrix}$$

$$q’_2 = R(0.2) \cdot q_2 = \begin{bmatrix} 0.980 & -0.199 \ 0.199 & 0.980 \end{bmatrix} \begin{bmatrix} 0.9 \ 0.7 \end{bmatrix} = \begin{bmatrix} 0.882 - 0.1393 \ 0.1791 + 0.686 \end{bmatrix} = \begin{bmatrix} 0.743 \ 0.865 \end{bmatrix}$$

$$k’_2 = R(0.2) \cdot k_2 = \begin{bmatrix} 0.980 & -0.199 \ 0.199 & 0.980 \end{bmatrix} \begin{bmatrix} 0.7 \ 0.5 \end{bmatrix} = \begin{bmatrix} 0.686 - 0.0995 \ 0.1393 + 0.490 \end{bmatrix} = \begin{bmatrix} 0.586 \ 0.629 \end{bmatrix}$$

Attention score between $q’_2$ and $k’_1$ (with RoPE): $$\text{score}_{\text{RoPE}} = q’_2 \cdot k’_1 = 0.743 \cdot 0.736 + 0.865 \cdot 0.677 = 0.547 + 0.586 = 1.133$$

Comparison:

  • Without RoPE: score = 1.452
  • With RoPE: score = 1.133

The RoPE score encodes the relative distance (position 2 - position 1 = 1), while the absolute embedding score depends on absolute positions. RoPE will generalize better to longer sequences.


Summary: Full Trace

OperationInputOutputKey Insight
RMSNorm[0.5, -1.2, 0.8, 0.3][0.643, -1.543, 1.029, 0.386]Normalizes via RMS, simpler than LayerNorm
Attention[0.643, …][0.6, -1.5, 1.0, 0.4]Computes similarities (simplified)
Residual[0.5, -1.2, 0.8, 0.3] + [0.6, -1.5, 1.0, 0.4][1.1, -2.7, 1.8, 0.7]Preserves information via skip connection
RMSNorm (2nd)[1.1, -2.7, 1.8, 0.7][0.629, -1.544, 1.030, 0.400]Normalize again before FFN
SwiGLU[0.629, …][0.404, 0.097]Smooth activation + gating > ReLU
RoPE (alt)Rotates by relative positionRelative-position encodingGeneralizes to longer sequences

All three techniques work together to make LLaMA efficient and scalable.