Section 04

The math: alignment scores, attention weights, and context vectors

Neural Machine Translation by Jointly Learning to Align and Translate 2014

4. The math — alignment scores, attention weights, and context vectors

🔴 Advanced undergrad. This section uses matrix multiplication and the softmax function. If you need a refresher, read Matrix Multiplication and Softmax Function before continuing.


The architecture

The Bahdanau attention model has three components:

  1. A bidirectional encoder producing one hidden state per source word
  2. An alignment model (additive attention) scoring how relevant each source state is to the current decoder state
  3. A decoder that uses a fresh, step-specific context vector at each generation step

Step 1: The bidirectional encoder

Let the source sentence have T words. A forward LSTM reads left to right:

→h₁, →h₂, ..., →h_T

A backward LSTM reads right to left:

←h_T, ←h_{T-1}, ..., ←h₁

These are concatenated into one hidden state per source word:

hᵢ = [→hᵢ ; ←hᵢ]

Where [ ; ] means concatenation. If each individual LSTM state has dimension d, then each hᵢ has dimension 2d.

Why concatenate? Each hᵢ now contains information about word i from its full sentence context — words before AND after it. Compare to seq2seq’s encoder, where hᵢ only knew about words 1 through i.


Step 2: Alignment scores (additive attention)

At decoding step t, the decoder has a previous hidden state s_{t-1}. We want to score how compatible this decoder state is with each source hidden state hᵢ.

Bahdanau’s alignment model is:

eₜᵢ = vₐᵀ · tanh(Wₐ · s_{t-1} + Uₐ · hᵢ)

Where:

  • s_{t-1} — decoder’s previous hidden state (a vector of dimension n)
  • hᵢ — encoder hidden state for source word i (a vector of dimension 2d)
  • Wₐ — learnable weight matrix (n × n) applied to the decoder state
  • Uₐ — learnable weight matrix (n × 2d) applied to the encoder state
  • tanh — element-wise non-linearity, squashes values to (−1, +1)
  • vₐ — learnable weight vector that projects the combined representation to a scalar
  • eₜᵢ — the resulting scalar score: how relevant source word i is at decoding step t

In plain words: transform both the decoder state and the source state into the same space, add them, squash through tanh, then project to a single number. This single number is the alignment score.

This formulation is called additive attention (also called Bahdanau attention). It is different from dot-product attention (Luong / Transformer), where the score is simply s · hᵢ. We will see that simpler formula in Paper 08 — it is faster but loses the tanh non-linearity.


Step 3: Attention weights via softmax

Compute alignment scores for all T source positions at step t:

eₜ = [eₜ₁, eₜ₂, ..., eₜ_T]

Convert to attention weights using softmax:

αₜᵢ = exp(eₜᵢ) / Σⱼ exp(eₜⱼ)

The attention weights αₜ = [αₜ₁, αₜ₂, …, αₜ_T] are a proper probability distribution: each αₜᵢ ≥ 0 and Σᵢ αₜᵢ = 1.

Interpretation: αₜᵢ is the probability that target word t is aligned to source word i. If the model is generating the French word “économique” and the English source contains “economic” at position 4, then αₜ₄ should be large — close to 1 — while other positions have small weights.


Step 4: Context vector as a weighted sum

The context vector for decoding step t is the attention-weighted sum of all encoder hidden states:

cₜ = Σᵢ αₜᵢ · hᵢ

This is a soft lookup: instead of retrieving one encoder state (hard lookup), we retrieve a blend of all of them, weighted by relevance. Words with high attention weight contribute most to cₜ.

Important: cₜ is different at every decoding step. Unlike seq2seq where the decoder saw the same context vector C at every step, Bahdanau’s decoder gets a fresh, query-specific context vector each time. This is the entire point of the mechanism.


Step 5: Decoder update

The decoder hidden state at step t:

sₜ = f(sₜ₋₁, yₜ₋₁, cₜ)

Where:

  • sₜ₋₁ — previous decoder hidden state
  • yₜ₋₁ — previous output word (as a vector)
  • cₜ — context vector (freshly computed via attention)
  • f — a GRU or LSTM cell

The output probability over the vocabulary:

P(yₜ | y₁,...,yₜ₋₁, x) = softmax(Wo · g(sₜ, yₜ₋₁, cₜ))

Where g is a transformation (often a single linear layer) and Wo is the output projection matrix.


Worked numerical example

Let us translate the Hindi phrase “Subah chai piyo” (Drink tea in the morning) with a tiny toy model.

Setup:

  • Source: 3 words → 3 encoder hidden states, each 2-dimensional (in practice, 1000-dimensional)
  • Decoder state: 2-dimensional
  • To keep numbers clean, we skip the full bidirectional encoder and work with direct encoder states
Encoder hidden states (after bidirectional encoding):
h₁ = [0.6, 0.2]   ("Subah" — morning)
h₂ = [0.4, 0.9]   ("chai"  — tea)
h₃ = [0.5, 0.3]   ("piyo"  — drink)

Decoder previous state:
s₀ = [0.3, 0.7]

Weight matrices (tiny, hand-pickable):

Wₐ = [[0.5, 0.1],    Uₐ = [[0.4, 0.2],    vₐ = [1.0, 1.0]
      [0.2, 0.5]]          [0.1, 0.4]]

Compute Wₐ · s₀:

Wₐ · s₀ = [[0.5, 0.1], [0.2, 0.5]] · [0.3, 0.7]
         = [0.5×0.3 + 0.1×0.7,  0.2×0.3 + 0.5×0.7]
         = [0.15 + 0.07,  0.06 + 0.35]
         = [0.22, 0.41]

This term is the same for all source positions (it only depends on the decoder state).

For each source word, compute Uₐ · hᵢ:

For h₁ = [0.6, 0.2]:
Uₐ · h₁ = [0.4×0.6 + 0.2×0.2,  0.1×0.6 + 0.4×0.2]
         = [0.24 + 0.04,  0.06 + 0.08]
         = [0.28, 0.14]

For h₂ = [0.4, 0.9]:
Uₐ · h₂ = [0.4×0.4 + 0.2×0.9,  0.1×0.4 + 0.4×0.9]
         = [0.16 + 0.18,  0.04 + 0.36]
         = [0.34, 0.40]

For h₃ = [0.5, 0.3]:
Uₐ · h₃ = [0.4×0.5 + 0.2×0.3,  0.1×0.5 + 0.4×0.3]
         = [0.20 + 0.06,  0.05 + 0.12]
         = [0.26, 0.17]

Add decoder and encoder terms, apply tanh:

For word 1: [0.22+0.28, 0.41+0.14] = [0.50, 0.55]
  tanh([0.50, 0.55]) ≈ [0.462, 0.503]
  e₁ = vₐᵀ · [0.462, 0.503] = 1×0.462 + 1×0.503 = 0.965

For word 2: [0.22+0.34, 0.41+0.40] = [0.56, 0.81]
  tanh([0.56, 0.81]) ≈ [0.508, 0.670]
  e₂ = 0.508 + 0.670 = 1.178

For word 3: [0.22+0.26, 0.41+0.17] = [0.48, 0.58]
  tanh([0.48, 0.58]) ≈ [0.447, 0.523]
  e₃ = 0.447 + 0.523 = 0.970

Apply softmax to get attention weights:

exp(0.965) ≈ 2.625
exp(1.178) ≈ 3.248
exp(0.970) ≈ 2.638

Sum = 2.625 + 3.248 + 2.638 = 8.511

α₁ = 2.625 / 8.511 ≈ 0.308
α₂ = 3.248 / 8.511 ≈ 0.382   ← highest: model attends most to "chai"
α₃ = 2.638 / 8.511 ≈ 0.310

Check: 0.308 + 0.382 + 0.310 = 1.000 ✓

Compute context vector:

c₁ = α₁·h₁ + α₂·h₂ + α₃·h₃

   = 0.308×[0.6, 0.2] + 0.382×[0.4, 0.9] + 0.310×[0.5, 0.3]

   = [0.185, 0.062] + [0.153, 0.344] + [0.155, 0.093]

   = [0.493, 0.499]

The context vector c₁ = [0.493, 0.499] is a blend of all three source states, with “chai” (word 2) receiving the most weight (38.2%). The decoder uses this vector to generate the first English word — ideally “tea” or “morning.”

Notice that if the model were translating and had to generate “tea” next, ideally α₂ would be near 1.0. The model learns this alignment through training; our toy numbers just illustrate the mechanics.


Additive vs dot-product attention

For comparison: in Paper 08 (Transformer), the alignment score is simply:

eᵢ = qᵀ · kᵢ    (dot product of query and key vectors)

This is faster (no tanh, no projection vector) and scales well with matrix operations. Bahdanau’s additive formulation has a tanh non-linearity that adds expressiveness but is slower.

In 2015, Luong et al. showed that dot-product attention achieves similar or better results with much less computation. Paper 08 adopts dot-product attention scaled by √d (to prevent large values saturating softmax). But the core idea — score, softmax, weighted sum — is exactly Bahdanau’s.