Mutual Information
Mutual Information: How Much Do Two Variables Share?
The Core Idea
Mutual information measures how much knowing one variable tells you about another. It answers the question: if I know X, how much uncertainty about Y disappears?
Think of it like this. You and your friend both follow IPL cricket. You message each other every morning.
- If you already know your team won yesterday (from the news), and your friend says “we won!” — that message carries zero new information. You already knew.
- If you had no idea about the result and your friend says “we won!” — that message carries a lot of information. Your uncertainty about Y (result) collapsed completely once you knew X (friend’s message).
Mutual information quantifies exactly this collapse in uncertainty.
Formal definition:
$$I(X; Y) = H(X) - H(X \mid Y)$$
Where:
- $H(X)$ is the entropy of $X$ — your uncertainty about $X$ before knowing $Y$
- $H(X \mid Y)$ is the conditional entropy — your remaining uncertainty about $X$ after knowing $Y$
- $I(X; Y)$ is the reduction in uncertainty — information shared between $X$ and $Y$
Equivalently:
$$I(X; Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y)$$
All three forms give the same number. The first is most intuitive: mutual information is how much entropy disappears when you learn the other variable.
Conditional Entropy: The Missing Piece
Before computing mutual information, you need conditional entropy $H(X \mid Y)$.
$$H(X \mid Y) = \sum_{y} P(y) \cdot H(X \mid Y = y)$$
This is: for each possible value of $Y$, compute the entropy of $X$ given that value, then average across all $Y$ values weighted by their probability.
If $Y$ completely determines $X$, then $H(X \mid Y) = 0$ — no uncertainty remains. If $Y$ tells you nothing about $X$, then $H(X \mid Y) = H(X)$ — all the original uncertainty remains.
Worked Example: Weather and Umbrella
Suppose you’re deciding whether to carry an umbrella. Let:
- $X$ = Weather tomorrow: Rain (probability 0.4) or Sun (probability 0.6)
- $Y$ = Weather forecast: Rain forecast or Sun forecast
The forecast is not perfect. Given the joint distribution:
| Forecast: Rain | Forecast: Sun | |
|---|---|---|
| Actual: Rain | 0.36 | 0.04 |
| Actual: Sun | 0.12 | 0.48 |
Step 1 — Compute H(X), entropy of actual weather:
$P(\text{Rain}) = 0.36 + 0.04 = 0.40$, $P(\text{Sun}) = 0.60$
$$H(X) = -(0.4 \log_2 0.4 + 0.6 \log_2 0.6)$$ $$= -(0.4 \times (-1.322) + 0.6 \times (-0.737))$$ $$= -(-0.529 - 0.442) = 0.971 \text{ bits}$$
Step 2 — Compute H(X | Y), conditional entropy:
First find the marginals of Y:
- $P(\text{Forecast Rain}) = 0.36 + 0.12 = 0.48$
- $P(\text{Forecast Sun}) = 0.04 + 0.48 = 0.52$
Given Forecast = Rain: $P(\text{Rain} \mid \text{F=Rain}) = 0.36/0.48 = 0.75$, $P(\text{Sun} \mid \text{F=Rain}) = 0.25$
$$H(X \mid Y=\text{Rain}) = -(0.75 \log_2 0.75 + 0.25 \log_2 0.25) = -(0.75 \times (-0.415) + 0.25 \times (-2)) = 0.811 \text{ bits}$$
Given Forecast = Sun: $P(\text{Rain} \mid \text{F=Sun}) = 0.04/0.52 \approx 0.077$, $P(\text{Sun} \mid \text{F=Sun}) \approx 0.923$
$$H(X \mid Y=\text{Sun}) = -(0.077 \log_2 0.077 + 0.923 \log_2 0.923) \approx -(0.077 \times (-3.70) + 0.923 \times (-0.115)) \approx 0.391 \text{ bits}$$
Now average:
$$H(X \mid Y) = 0.48 \times 0.811 + 0.52 \times 0.391 = 0.389 + 0.203 = 0.592 \text{ bits}$$
Step 3 — Compute I(X; Y):
$$I(X; Y) = H(X) - H(X \mid Y) = 0.971 - 0.592 = 0.379 \text{ bits}$$
Interpretation: Knowing the forecast reduces your uncertainty about actual weather by 0.379 bits out of a maximum possible 0.971 bits. The forecast is helpful, but not perfect — about 39% of your uncertainty is resolved by it.
Key Properties
1. Symmetric: $I(X; Y) = I(Y; X)$ — the amount X tells you about Y equals the amount Y tells you about X.
2. Non-negative: $I(X; Y) \geq 0$ always. Knowing something can only reduce or maintain uncertainty, never increase it.
3. Zero when independent: If $X$ and $Y$ are statistically independent, $I(X; Y) = 0$. Knowing one tells you nothing about the other.
4. Bounded: $I(X; Y) \leq \min(H(X), H(Y))$. Mutual information can’t exceed the entropy of either variable.
Connection to Attention Mechanisms
This is where mutual information becomes directly relevant to Paper 08 (Transformer) and Paper 07 (Bahdanau Attention).
In an attention mechanism, at each decoding step, the model computes a query $q$ (what am I looking for?) and a set of keys $k_1, k_2, \ldots, k_n$ (what information is available from the input?).
The attention weights $\alpha_i$ tell the model: how much should I focus on input position $i$ when generating this output token?
The mutual information interpretation: Attention weights are approximately proportional to $I(q; k_i)$ — the mutual information between the query and each key. A high attention weight at position $i$ means: knowing the content at input position $i$ substantially reduces uncertainty about what the correct output should be.
More precisely, if the query represents “what word am I trying to generate?” and key $k_i$ represents “what word appeared at input position $i$?”, then positions with high mutual information with the query get high attention weights.
In the Transformer, this connection is made explicit through scaled dot-product attention:
$$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
The dot product $QK^T$ is a proxy for how aligned the query and key are — which approximates mutual information under certain distributional assumptions. The $\sqrt{d_k}$ scaling keeps the entropy of the softmax distribution stable regardless of dimension.
The beautiful result: attention is a differentiable, learnable way to maximise the mutual information between the query and the most relevant keys — the model learns which positions to attend to by learning which positions are most informative for each query.
Python: Computing Mutual Information
import numpy as np
# Joint probability table: rows = X (Rain/Sun), cols = Y (Forecast Rain/Sun)
joint = np.array([
[0.36, 0.04], # X=Rain: P(Rain,FRain), P(Rain,FSun)
[0.12, 0.48], # X=Sun: P(Sun,FRain), P(Sun,FSun)
])
def entropy(p):
"""Shannon entropy of a probability distribution."""
p = p[p > 0] # Ignore zero-probability outcomes
return -np.sum(p * np.log2(p))
# Marginals
px = joint.sum(axis=1) # [0.40, 0.60]
py = joint.sum(axis=0) # [0.48, 0.52]
# H(X) — entropy of actual weather
hx = entropy(px) # 0.971 bits
# H(X|Y) — conditional entropy
hx_given_y = 0.0
for j in range(joint.shape[1]):
# Conditional distribution of X given Y=j
p_x_given_yj = joint[:, j] / py[j]
hx_given_y += py[j] * entropy(p_x_given_yj)
# Mutual information
mi = hx - hx_given_y
print(f"H(X) = {hx:.3f} bits")
print(f"H(X|Y) = {hx_given_y:.3f} bits")
print(f"I(X;Y) = {mi:.3f} bits") # → 0.379 bits
Run this free on Google Colab — no install needed.
Summary
| Concept | Formula | Meaning |
|---|---|---|
| Entropy | $H(X) = -\sum P(x) \log_2 P(x)$ | Total uncertainty in $X$ |
| Conditional Entropy | $H(X \mid Y) = \sum_y P(y) H(X \mid Y=y)$ | Remaining uncertainty after knowing $Y$ |
| Mutual Information | $I(X;Y) = H(X) - H(X \mid Y)$ | Uncertainty in $X$ resolved by $Y$ |
| Independent variables | $I(X;Y) = 0$ | Knowing one tells you nothing |
| Perfect predictor | $I(X;Y) = H(X)$ | $Y$ fully determines $X$ |
Key Takeaways
- Mutual information measures shared information. It’s the reduction in uncertainty about $X$ when you learn $Y$.
- It’s symmetric: $I(X;Y) = I(Y;X)$. The weather tells you as much about the forecast as the forecast tells you about the weather.
- It’s always non-negative. Learning something can never increase your uncertainty.
- Attention mechanisms implement a learnable proxy for mutual information. High attention weights ≈ high mutual information between query and key.
- Prerequisites: Make sure you’re comfortable with Entropy and Conditional Probability before this tutorial.
Used in:
- Paper 07: Neural Machine Translation by Jointly Learning to Align and Translate — attention weights as relevance scores
- Paper 08: Attention Is All You Need — scaled dot-product attention as mutual information proxy