Entropy
Entropy: Measuring Uncertainty
What is Entropy?
Entropy is a measure of uncertainty, randomness, or lack of information in a probability distribution. The more uncertain you are about the outcome of an event, the higher the entropy.
Think of it this way: if you flip a fair coin (50% heads, 50% tails), the outcome is very uncertain—entropy is high. But if you flip a biased coin that always lands heads, the outcome is certain—entropy is zero because you know exactly what will happen.
Formal definition:
$$H(X) = -\sum_{x} P(x) \log_2 P(x)$$
Where:
- $H(X)$ is the entropy of random variable $X$
- $P(x)$ is the probability of outcome $x$
- $\log_2$ is the logarithm base 2 (measured in “bits”)
- The sum is over all possible outcomes
Intuition: Information Content
When you see something unexpected, it gives you more information than seeing something expected.
Example: You’re watching a weather forecast.
-
Announcer says: “It will rain tomorrow with 99% probability, and it’s raining 99% of the time in monsoon season.”
→ You expected rain, so this tells you almost nothing new. Low information. -
Announcer says: “It will rain tomorrow with 1% probability, and it’s usually sunny.”
→ This is surprising! It tells you a lot. High information.
Entropy quantifies this: low probability outcomes are more informative than high probability outcomes.
Worked Example: Fair Coin Flip
A fair coin has two outcomes: heads or tails, each with probability 0.5.
Calculate entropy:
$$H(X) = -[P(\text{heads}) \log_2 P(\text{heads}) + P(\text{tails}) \log_2 P(\text{tails})]$$
$$H(X) = -[0.5 \log_2 0.5 + 0.5 \log_2 0.5]$$
Now, $\log_2 0.5 = \log_2 (1/2) = -1$
$$H(X) = -[0.5 \times (-1) + 0.5 \times (-1)]$$
$$H(X) = -[-0.5 - 0.5] = -[-1] = 1 \text{ bit}$$
Interpretation: A fair coin flip has entropy of 1 bit. This makes sense: you need 1 bit of information to specify the outcome (0 for heads, 1 for tails).
Worked Example: Biased Coin (99% Heads)
A biased coin: P(heads) = 0.99, P(tails) = 0.01
Calculate entropy:
$$H(X) = -[0.99 \log_2 0.99 + 0.01 \log_2 0.01]$$
Calculate each term:
- $\log_2 0.99 \approx -0.0145$
- $\log_2 0.01 \approx -6.644$
$$H(X) = -[0.99 \times (-0.0145) + 0.01 \times (-6.644)]$$
$$H(X) = -[-0.01436 - 0.06644]$$
$$H(X) = -[-0.0808] = 0.0808 \text{ bits}$$
Interpretation: A biased coin has entropy of ~0.08 bits. Much lower than 1 bit because we’re very confident it will be heads. The outcome is less surprising.
Worked Example: A Six-Sided Die
A fair die with 6 outcomes, each with probability 1/6.
Calculate entropy:
$$H(X) = -\sum_{i=1}^{6} \frac{1}{6} \log_2 \frac{1}{6}$$
$$H(X) = -6 \times \frac{1}{6} \log_2 \frac{1}{6}$$
$$H(X) = -\log_2 \frac{1}{6} = -\log_2(6^{-1}) = \log_2 6$$
$$\log_2 6 = \log_2(2 \times 3) = 1 + \log_2 3 \approx 1 + 1.585 = 2.585 \text{ bits}$$
Interpretation: A fair die has entropy of ~2.585 bits. Higher than a coin (1 bit) because there are more possible outcomes.
Key Properties
1. Entropy is Always Non-Negative
$$H(X) \geq 0$$
Entropy equals 0 when the distribution is certain (one outcome has probability 1, all others 0).
Entropy is maximum when all outcomes are equally likely.
2. Maximum Entropy
For a distribution with $n$ possible outcomes, maximum entropy is:
$$H_{\max} = \log_2 n$$
Example: A die with 6 faces has maximum entropy of $\log_2 6 \approx 2.585$ bits.
3. Entropy is Measured in Bits (if base-2 log)
If you use natural log (base $e$), entropy is measured in “nats”. If you use base-10, it’s measured in “dits”. But in information theory and machine learning, we typically use base-2 for intuitive “bits” interpretation.
Real-World Example: Language Models
In language models, entropy tells you how “confident” the model is about the next token.
Scenario 1: Predicting the next word in “The capital of France is ___”
The model might predict:
- P(Paris) = 0.95
- P(Lyon) = 0.02
- P(London) = 0.01
- P(other) = 0.02
$$H \approx -[0.95 \log_2 0.95 + 0.02 \log_2 0.02 + 0.01 \log_2 0.01 + 0.02 \log_2 0.02]$$
$$H \approx -[-0.070 - 0.115 - 0.066 - 0.115] = -[-0.366] \approx 0.366 \text{ bits}$$
Low entropy because the model is very confident in “Paris”.
Scenario 2: Predicting the next word after “I think the best pizza topping is ___”
The model might predict:
- P(pepperoni) = 0.15
- P(mushroom) = 0.15
- P(cheese) = 0.12
- P(olive) = 0.10
- … (many other toppings)
This distribution is more uniform, so entropy is higher (maybe 2.5+ bits). The model is less confident because opinions vary.
Why Entropy Matters
1. In Machine Learning
- Training: Cross-entropy loss (which uses entropy) measures how surprised the model is by the true label. Minimizing cross-entropy reduces entropy.
- Sampling: Higher entropy distributions are more random. Temperature in sampling controls entropy.
2. In Information Theory
- Data compression: Entropy is the theoretical minimum number of bits needed to encode a message.
- Communication: Higher entropy means more information per symbol.
3. In Reinforcement Learning (Paper 15 Context)
- Policy entropy: In policy gradient RL, entropy encourages exploration. High entropy = exploring many actions. Low entropy = committing to a few actions.
- KL divergence: Measures the “distance” between two entropy values.
Entropy vs. KL Divergence
Entropy measures uncertainty within a single distribution.
KL divergence measures the difference between two distributions. If you know entropy, KL divergence tells you how much extra information you need if you use one distribution instead of another.
(KL divergence has its own tutorial: KL Divergence)
Summary
| Concept | Formula | Interpretation |
|---|---|---|
| Entropy | $H(X) = -\sum P(x) \log P(x)$ | Uncertainty/randomness in a distribution |
| Maximum Entropy | $H_{\max} = \log n$ (for $n$ outcomes) | Perfectly uniform distribution |
| Zero Entropy | $H = 0$ | Completely certain (one outcome has probability 1) |
| Measured in | Bits (base-2), Nats (base-e) | Information content units |
Key Takeaways
- Entropy measures uncertainty. High entropy = surprising outcomes. Low entropy = predictable outcomes.
- Fair coin = 1 bit. Fair die with 6 faces = 2.585 bits.
- Biased distributions have lower entropy. The more skewed toward one outcome, the lower the entropy.
- In machine learning, entropy is used in cross-entropy loss (Paper 12, 15) and in RL exploration (Paper 15).
- Entropy ≥ 0, with equality only when the outcome is certain.
Further Reading
- Information Theory: A Tutorial Introduction by James V. Stone
- The Intuition Behind Shannon Entropy by Chris Olah
- Entropy Explainer with Visuals
Used in:
- Paper 15: Training Language Models to Follow Instructions with Human Feedback — KL divergence penalty uses entropy concepts
- [Paper 22: Diffusion Models] — Entropy in noise scheduling
- [Paper 24: Language Model Alignment] — Entropy in policy learning