Normalisation (Batch Norm & Layer Norm)

1. What is this and why do we care?

When you train a neural network, each layer passes its output to the next. As these numbers flow through dozens of layers, they can drift wildly — exploding to millions, collapsing to near-zero, or skewing heavily to one side. This instability makes training slow, unpredictable, or broken entirely.

Normalisation is the solution: after each layer, rescale the activations so they have a consistent, well-behaved range. The network then trains faster and more stably.

The Transformer (Paper 08) uses Layer Normalisation after every attention block and every feed-forward block. Papers 10–12 (GPT, BERT) do the same. Without layer norm, training Transformers at scale would be nearly impossible.

Understanding normalisation is not optional if you want to read any paper from Paper 08 onwards.

2. Prerequisites

You need to know what mean and standard deviation are (even intuitively). If you recall that the mean is the average, and the standard deviation measures how spread out numbers are, you are ready.

3. The intuition — before any symbols

Suppose you run a small kirana shop and receive deliveries of goods each day. Some days you get 5 items. Some days, 500. Some days the prices are in rupees; some days in paise. If you try to compare today’s accounts with last week’s accounts, the numbers are all over the place — different scales, different ranges.

The solution: normalise your accounts. Express everything as a percentage change from the weekly average, measured in units of your typical daily variation. Now all weeks are comparable regardless of the absolute scale.

This is exactly what normalisation does to neural network activations. It asks: “Compared to the average activation in this layer, how many standard deviations away is this value?” The answer is always on a familiar, consistent scale.

4. The core formula: standardisation

The basic normalisation formula converts a value x into a standardised (zero-mean, unit-variance) value:

x̂ = (x − μ) / σ

Where:

μ (mu) = mean of the values being normalised
σ (sigma) = standard deviation of those values
x̂ = the normalised value (read “x-hat”)

What this achieves:

If x = μ (exactly average), then x̂ = 0
If x = μ + σ (one standard deviation above average), then x̂ = 1
If x = μ − 2σ (two standard deviations below), then x̂ = −2

The normalised values always have mean 0 and standard deviation 1, regardless of the original scale.

Tiny worked example:

Suppose four neurons in a layer produce outputs: [3, 7, 5, 1].

Mean: μ = (3 + 7 + 5 + 1) / 4 = 16 / 4 = 4.0

Deviations from mean: [3−4, 7−4, 5−4, 1−4] = [−1, 3, 1, −3]

Variance: σ² = (1 + 9 + 1 + 9) / 4 = 20 / 4 = 5.0

Standard deviation: σ = √5.0 ≈ 2.236

Normalised values: x̂ = [−1/2.236, 3/2.236, 1/2.236, −3/2.236]
                      = [−0.447, 1.342, 0.447, −1.342]

Check mean: (−0.447 + 1.342 + 0.447 − 1.342) / 4 = 0 / 4 = 0.0 ✓
Check std:  std([−0.447, 1.342, 0.447, −1.342]) = 1.0 ✓

5. Learnable rescaling: γ and β

Raw normalisation forces every layer’s output to mean 0, std 1. But this might be too rigid — maybe the network needs a different mean or scale to solve the problem well.

The fix: after normalising, apply learnable scale (γ, gamma) and shift (β, beta) parameters:

y = γ · x̂ + β

Where γ and β are learned during training, just like weights. This lets the network “undo” the normalisation if that is what the task requires, while still getting the training stability benefits during gradient flow.

If γ = 1 and β = 0: pure normalisation (no change)
If γ = 2 and β = 0.5: scaled and shifted

In practice, γ and β start at 1 and 0 and drift during training to whatever values minimise the loss.

Full Layer Norm formula:

LayerNorm(x) = γ · (x − μ) / (σ + ε) + β

The tiny constant ε (epsilon, typically 1e-5) prevents division by zero if σ = 0.

6. Batch Norm vs Layer Norm — what is the difference?

These are the two most common normalisation strategies. The difference is what you normalise over.

Batch Normalisation (Batch Norm)

Proposed by Ioffe & Szegedy (2015). Normalise each feature across the entire batch of training examples.

For feature j:
μⱼ = mean of feature j across all B examples in the batch
σⱼ = std  of feature j across all B examples in the batch

Analogy: You are grading 30 students’ maths papers. For each question (feature), you normalise all 30 students’ scores for that question — zero mean, unit variance across students.

Problem: During inference, you only have one example, not a batch. Batch Norm requires you to pre-compute and store batch statistics from training, then use those frozen statistics at test time. This is awkward. It also does not work well for variable-length sequences, because sentences in a batch have different lengths.

Layer Normalisation (Layer Norm)

Proposed by Ba, Kiros & Hinton (2016). Normalise each example across all features in a single layer.

For one example x = [x₁, x₂, ..., xₙ]:
μ = mean of all n features for this single example
σ = std  of all n features for this single example

Analogy: You grade one student’s paper across all questions. For this one student, normalise their scores across all questions — not across all students.

Advantages for Transformers:

Works identically at training and inference time — no frozen statistics needed
Works on variable-length sequences naturally (each token normalised independently)
Parallelises perfectly — each position is normalised independently

This is why every major Transformer (BERT, GPT, LLaMA, Claude) uses Layer Norm, not Batch Norm.

7. Layer Norm in the Transformer — a full worked example

In the Transformer, a token’s representation is a vector of size d_model (typically 512 or 768). After the attention sub-layer, Layer Norm is applied to this vector.

Suppose d_model = 4. One token’s representation after attention is:

x = [2.1, −0.5, 3.8, 0.6]

Step 1: Compute mean

μ = (2.1 + (−0.5) + 3.8 + 0.6) / 4 = 6.0 / 4 = 1.5

Step 2: Compute deviations from mean

x − μ = [2.1−1.5, −0.5−1.5, 3.8−1.5, 0.6−1.5]
       = [0.6, −2.0, 2.3, −0.9]

Step 3: Compute variance and standard deviation

σ² = (0.6² + (−2.0)² + 2.3² + (−0.9)²) / 4
   = (0.36 + 4.00 + 5.29 + 0.81) / 4
   = 10.46 / 4
   = 2.615

σ = √2.615 ≈ 1.617

Step 4: Normalise

x̂ = [0.6/1.617, −2.0/1.617, 2.3/1.617, −0.9/1.617]
   = [0.371, −1.237, 1.422, −0.557]

Step 5: Apply learnable γ and β

Suppose (after training) γ = [1.2, 0.8, 1.5, 1.0] and β = [0.1, 0.0, −0.2, 0.0]:

y₁ = 1.2 × 0.371 + 0.1 = 0.445 + 0.1 = 0.545
y₂ = 0.8 × (−1.237) + 0.0 = −0.990
y₃ = 1.5 × 1.422 + (−0.2) = 2.133 − 0.2 = 1.933
y₄ = 1.0 × (−0.557) + 0.0 = −0.557

y = [0.545, −0.990, 1.933, −0.557]

This vector y is the normalised token representation, passed to the next sub-layer. The extreme value (3.8) has been brought into a reasonable range; the negative (−0.5) is still negative but not dominant.

8. The “Add & Norm” block in Transformers

In the Transformer architecture, normalisation is always paired with a residual connection (the “Add” part):

output = LayerNorm(x + SubLayer(x))

Where SubLayer is either the attention block or the feed-forward block.

The residual connection adds the original input x back to the sub-layer’s output before normalising. This is borrowed from ResNets (2016). The benefit: even if the sub-layer learns nothing useful (produces output ≈ 0), the original x flows through unchanged. Gradients can propagate backward through the addition directly, keeping the network trainable even when very deep.

The sequence is: compute attention output → add original input (residual) → layer-normalise.

Why this order? The original Transformer paper does Post-LN (norm after adding residual). Most modern models (GPT-2 onward) use Pre-LN (norm before the sub-layer, inside it), which trains more stably. Both are common.

9. Common mistakes

Confusing normalisation with activation functions. Normalisation (like layer norm) reshapes the distribution of values. Activation functions (like ReLU or tanh) introduce non-linearity. They do different jobs and are used together, not interchangeably.
Forgetting ε in the denominator. If all features have the same value (σ = 0), dividing by σ would give infinity or NaN. The ε (1e-5 or 1e-8) prevents this. Easy to forget when implementing from scratch.
Confusing Batch Norm and Layer Norm’s axes. Batch Norm averages across examples (fixing per-feature statistics). Layer Norm averages within one example (fixing per-example statistics). These are orthogonal directions — getting them backwards is a common implementation error.

10. Try it yourself

Exercise 1: Apply Layer Norm to the vector x = [4, 0, 8, 4] using γ = [1,1,1,1] and β = [0,0,0,0].

Show answer

μ = (4 + 0 + 8 + 4) / 4 = 16 / 4 = 4.0

x − μ = [0, −4, 4, 0]

σ² = (0 + 16 + 16 + 0) / 4 = 8.0 σ = √8 ≈ 2.828

x̂ = [0/2.828, −4/2.828, 4/2.828, 0/2.828] = [0.0, −1.414, 1.414, 0.0]

With γ=1, β=0: y = x̂. Mean of y = 0.0 ✓, std of y = 1.0 ✓

Exercise 2: In a Transformer with d_model = 6, one token vector is [1, 5, 3, 7, 2, 6]. Compute the mean and standard deviation used for Layer Norm.

Show answer

μ = (1 + 5 + 3 + 7 + 2 + 6) / 6 = 24 / 6 = 4.0

Deviations: [−3, 1, −1, 3, −2, 2]

σ² = (9 + 1 + 1 + 9 + 4 + 4) / 6 = 28 / 6 ≈ 4.667

σ = √4.667 ≈ 2.160

Each element would be normalised as (xᵢ − 4.0) / 2.160.

Exercise 3: A residual connection is applied: x = [1.0, 2.0] and SubLayer(x) = [0.5, −0.5]. What is the input to Layer Norm?

Show answer

Input to LayerNorm = x + SubLayer(x) = [1.0 + 0.5, 2.0 + (−0.5)] = [1.5, 1.5]

Note: this vector has mean 1.5 and std 0 — every element is the same. After Layer Norm (without ε this would divide by zero), with ε added, x̂ ≈ [0, 0]. The γ and β terms then determine the final output. This illustrates why ε is essential.

Previous tutorial: Probability Distributions ← Next tutorial: Mean, Variance & Standard Deviation → Used in: Paper 08 — Transformer →