Entropy

🟡 intermediate

Entropy: Measuring Uncertainty

What is Entropy?

Entropy is a measure of uncertainty, randomness, or lack of information in a probability distribution. The more uncertain you are about the outcome of an event, the higher the entropy.

Think of it this way: if you flip a fair coin (50% heads, 50% tails), the outcome is very uncertain—entropy is high. But if you flip a biased coin that always lands heads, the outcome is certain—entropy is zero because you know exactly what will happen.

Formal definition:

$$H(X) = -\sum_{x} P(x) \log_2 P(x)$$

Where:

  • $H(X)$ is the entropy of random variable $X$
  • $P(x)$ is the probability of outcome $x$
  • $\log_2$ is the logarithm base 2 (measured in “bits”)
  • The sum is over all possible outcomes

Intuition: Information Content

When you see something unexpected, it gives you more information than seeing something expected.

Example: You’re watching a weather forecast.

  • Announcer says: “It will rain tomorrow with 99% probability, and it’s raining 99% of the time in monsoon season.”
    → You expected rain, so this tells you almost nothing new. Low information.

  • Announcer says: “It will rain tomorrow with 1% probability, and it’s usually sunny.”
    → This is surprising! It tells you a lot. High information.

Entropy quantifies this: low probability outcomes are more informative than high probability outcomes.

Worked Example: Fair Coin Flip

A fair coin has two outcomes: heads or tails, each with probability 0.5.

Calculate entropy:

$$H(X) = -[P(\text{heads}) \log_2 P(\text{heads}) + P(\text{tails}) \log_2 P(\text{tails})]$$

$$H(X) = -[0.5 \log_2 0.5 + 0.5 \log_2 0.5]$$

Now, $\log_2 0.5 = \log_2 (1/2) = -1$

$$H(X) = -[0.5 \times (-1) + 0.5 \times (-1)]$$

$$H(X) = -[-0.5 - 0.5] = -[-1] = 1 \text{ bit}$$

Interpretation: A fair coin flip has entropy of 1 bit. This makes sense: you need 1 bit of information to specify the outcome (0 for heads, 1 for tails).

Worked Example: Biased Coin (99% Heads)

A biased coin: P(heads) = 0.99, P(tails) = 0.01

Calculate entropy:

$$H(X) = -[0.99 \log_2 0.99 + 0.01 \log_2 0.01]$$

Calculate each term:

  • $\log_2 0.99 \approx -0.0145$
  • $\log_2 0.01 \approx -6.644$

$$H(X) = -[0.99 \times (-0.0145) + 0.01 \times (-6.644)]$$

$$H(X) = -[-0.01436 - 0.06644]$$

$$H(X) = -[-0.0808] = 0.0808 \text{ bits}$$

Interpretation: A biased coin has entropy of ~0.08 bits. Much lower than 1 bit because we’re very confident it will be heads. The outcome is less surprising.

Worked Example: A Six-Sided Die

A fair die with 6 outcomes, each with probability 1/6.

Calculate entropy:

$$H(X) = -\sum_{i=1}^{6} \frac{1}{6} \log_2 \frac{1}{6}$$

$$H(X) = -6 \times \frac{1}{6} \log_2 \frac{1}{6}$$

$$H(X) = -\log_2 \frac{1}{6} = -\log_2(6^{-1}) = \log_2 6$$

$$\log_2 6 = \log_2(2 \times 3) = 1 + \log_2 3 \approx 1 + 1.585 = 2.585 \text{ bits}$$

Interpretation: A fair die has entropy of ~2.585 bits. Higher than a coin (1 bit) because there are more possible outcomes.

Key Properties

1. Entropy is Always Non-Negative

$$H(X) \geq 0$$

Entropy equals 0 when the distribution is certain (one outcome has probability 1, all others 0).

Entropy is maximum when all outcomes are equally likely.

2. Maximum Entropy

For a distribution with $n$ possible outcomes, maximum entropy is:

$$H_{\max} = \log_2 n$$

Example: A die with 6 faces has maximum entropy of $\log_2 6 \approx 2.585$ bits.

3. Entropy is Measured in Bits (if base-2 log)

If you use natural log (base $e$), entropy is measured in “nats”. If you use base-10, it’s measured in “dits”. But in information theory and machine learning, we typically use base-2 for intuitive “bits” interpretation.

Real-World Example: Language Models

In language models, entropy tells you how “confident” the model is about the next token.

Scenario 1: Predicting the next word in “The capital of France is ___”

The model might predict:

  • P(Paris) = 0.95
  • P(Lyon) = 0.02
  • P(London) = 0.01
  • P(other) = 0.02

$$H \approx -[0.95 \log_2 0.95 + 0.02 \log_2 0.02 + 0.01 \log_2 0.01 + 0.02 \log_2 0.02]$$

$$H \approx -[-0.070 - 0.115 - 0.066 - 0.115] = -[-0.366] \approx 0.366 \text{ bits}$$

Low entropy because the model is very confident in “Paris”.

Scenario 2: Predicting the next word after “I think the best pizza topping is ___”

The model might predict:

  • P(pepperoni) = 0.15
  • P(mushroom) = 0.15
  • P(cheese) = 0.12
  • P(olive) = 0.10
  • … (many other toppings)

This distribution is more uniform, so entropy is higher (maybe 2.5+ bits). The model is less confident because opinions vary.

Why Entropy Matters

1. In Machine Learning

  • Training: Cross-entropy loss (which uses entropy) measures how surprised the model is by the true label. Minimizing cross-entropy reduces entropy.
  • Sampling: Higher entropy distributions are more random. Temperature in sampling controls entropy.

2. In Information Theory

  • Data compression: Entropy is the theoretical minimum number of bits needed to encode a message.
  • Communication: Higher entropy means more information per symbol.

3. In Reinforcement Learning (Paper 15 Context)

  • Policy entropy: In policy gradient RL, entropy encourages exploration. High entropy = exploring many actions. Low entropy = committing to a few actions.
  • KL divergence: Measures the “distance” between two entropy values.

Entropy vs. KL Divergence

Entropy measures uncertainty within a single distribution.

KL divergence measures the difference between two distributions. If you know entropy, KL divergence tells you how much extra information you need if you use one distribution instead of another.

(KL divergence has its own tutorial: KL Divergence)

Summary

ConceptFormulaInterpretation
Entropy$H(X) = -\sum P(x) \log P(x)$Uncertainty/randomness in a distribution
Maximum Entropy$H_{\max} = \log n$ (for $n$ outcomes)Perfectly uniform distribution
Zero Entropy$H = 0$Completely certain (one outcome has probability 1)
Measured inBits (base-2), Nats (base-e)Information content units

Key Takeaways

  1. Entropy measures uncertainty. High entropy = surprising outcomes. Low entropy = predictable outcomes.
  2. Fair coin = 1 bit. Fair die with 6 faces = 2.585 bits.
  3. Biased distributions have lower entropy. The more skewed toward one outcome, the lower the entropy.
  4. In machine learning, entropy is used in cross-entropy loss (Paper 12, 15) and in RL exploration (Paper 15).
  5. Entropy ≥ 0, with equality only when the outcome is certain.

Further Reading


Used in: