Cross-Entropy Loss
Cross-Entropy Loss
1. What is this and why do we care?
Every time a language model makes a mistake — predicts “banana” when the correct next word was “chai” — something needs to measure how wrong it was. That measure is the cross-entropy loss.
Cross-entropy loss is the training signal for virtually every language model ever built. It is what GPT, BERT, LLaMA, and Claude were trained to minimise. It converts the gap between what the model predicts and what actually happened into a single number. Gradient descent then nudges the model’s weights to make that number smaller.
Without cross-entropy loss (or a close relative of it), neural networks cannot be trained on language. It is the heartbeat of every paper from 09 onwards.
2. Prerequisites
You need Probability Distributions and Softmax Function. The core idea builds directly on those.
3. The intuition — before any symbols
You are a cricket commentator guessing the outcome of the next ball. You assign probabilities to three outcomes:
Your prediction:
Six → 0.10 (10% chance)
Four → 0.20 (20% chance)
Dot → 0.70 (70% chance)
Then the batsman hits a six. How wrong were you?
Intuitively: very wrong. You gave the six only a 10% chance, but it happened. Your prediction was confident in the wrong direction.
Now imagine you had predicted:
Better prediction:
Six → 0.70 (70% chance)
Four → 0.20 (20% chance)
Dot → 0.10 (10% chance)
The six happens again. Now you were barely wrong — you were 70% confident in the right outcome.
Cross-entropy loss captures this intuition with a single formula:
Loss = −log( probability assigned to the correct outcome )
- Bad prediction (10% on correct): Loss = −log(0.10) = 2.303 (high)
- Good prediction (70% on correct): Loss = −log(0.70) = 0.357 (low)
- Perfect prediction (100% on correct): Loss = −log(1.00) = 0.000 (zero)
The higher the probability you assigned to what actually happened, the lower your loss. Training pushes probabilities of correct outcomes up — and loss down.
4. A tiny worked example with real numbers
A language model is predicting the next word after “subah uthke”. The true next word is “chai”. The model’s softmax output (a categorical distribution over 4 words) is:
Vocabulary: "chai" "paani" "dudh" "coffee"
Probability: 0.40 0.30 0.20 0.10
The model gives “chai” (the correct word) a probability of 0.40.
Cross-entropy loss:
L = −log(0.40) = −(−0.916) = 0.916
Now suppose after one training step, the model improves:
Vocabulary: "chai" "paani" "dudh" "coffee"
Probability: 0.75 0.15 0.07 0.03
New loss:
L = −log(0.75) = −(−0.288) = 0.288
The loss dropped from 0.916 to 0.288. The model is getting better. This is the number that gradient descent minimises.
Note: we only need the probability on the correct word. The other probabilities (paani, dudh, coffee) affect this indirectly — they must all be smaller if chai’s probability is to be large, since all probabilities sum to 1.
5. The general formula
For a single prediction with K possible classes, where y is the true class (one-hot vector with a 1 at position y and 0 elsewhere), and p is the model’s predicted probability vector:
L = − Σᵢ yᵢ · log(pᵢ)
Since y is one-hot (only one yᵢ = 1, all others = 0), this simplifies to:
L = −log( p_correct )
Where p_correct is the probability the model assigned to the correct class.
For a training set of N examples:
L_total = −(1/N) · Σₙ log( p_correct(n) )
This is the average cross-entropy loss (also called negative log-likelihood) over the dataset.
6. The log connection: why logarithm?
Why use logarithm instead of just looking at p_correct directly?
Reason 1: Multiplicative errors become additive. The probability of a correct sequence of T words is p₁ × p₂ × … × p_T. These multiplications make the number astronomically small for long sequences (0.5¹⁰⁰ ≈ 10⁻³⁰). Taking the log converts products to sums: log(p₁ × p₂ × … × p_T) = log(p₁) + log(p₂) + … + log(p_T). Sums are numerically stable; tiny products are not.
Reason 2: Logarithms penalise overconfidence more strongly. Compare:
p_correct = 0.90 → Loss = −log(0.90) = 0.105 (small penalty, almost right)
p_correct = 0.50 → Loss = −log(0.50) = 0.693 (moderate)
p_correct = 0.10 → Loss = −log(0.10) = 2.303 (harsh penalty)
p_correct = 0.01 → Loss = −log(0.01) = 4.605 (very harsh)
The loss grows rapidly as p_correct drops toward zero. Being 90% confident in the wrong answer is penalised much more than being 50% uncertain. This is the right behaviour for training: the model should not be falsely confident.
Reason 3: Maximum likelihood estimation. Minimising cross-entropy loss is equivalent to maximum likelihood estimation — finding the model parameters that make the training data most probable. This is a well-grounded probabilistic foundation. Cross-entropy is not an arbitrary choice; it is the mathematically natural measure for comparing probability distributions.
7. Cross-entropy vs KL divergence
Cross-entropy is closely related to KL divergence (Kullback-Leibler divergence), a general measure of how different two distributions are:
KL(P || Q) = Σᵢ P(i) · log( P(i) / Q(i) )
= Σᵢ P(i) · log(P(i)) − Σᵢ P(i) · log(Q(i))
= −H(P) + CrossEntropy(P, Q)
Where H(P) is the entropy of the true distribution P.
When the true labels are one-hot (a single correct answer, no uncertainty in the ground truth), H(P) = 0, so:
KL divergence = CrossEntropy (when true labels are one-hot)
This is why minimising cross-entropy loss and minimising KL divergence are the same thing during language model training. You will see both terms used interchangeably in papers.
8. Perplexity: the reported metric
In language modelling papers, you rarely see “cross-entropy loss” in the results tables. Instead, you see perplexity:
Perplexity = exp( L ) = e^(average cross-entropy loss)
Example: If a language model achieves average loss of 3.0 on a test set:
Perplexity = e^3.0 ≈ 20.1
Interpretation: the model is, on average, as uncertain as if it were choosing uniformly among 20 words at every step. Lower perplexity = better model.
GPT-1 achieved ~18.4 perplexity on Penn Treebank (2018). GPT-2 achieved 8.6. GPT-3 achieved 5.0 on many benchmarks. Better models have lower perplexity — they assign higher probability to what actually came next.
9. Where cross-entropy appears in the papers
Paper 09 — Mixture of Experts: The MoE layer is trained end-to-end with cross-entropy loss on language modelling. An auxiliary cross-entropy-like loss also encourages expert load balancing.
Papers 10–12 — GPT-1, BERT, GPT-3: All pre-trained by minimising cross-entropy loss on large text corpora. For GPT models, the task is “predict the next token.” For BERT, it is “predict the masked tokens.” Cross-entropy is the training objective in both cases.
Paper 14 — Chain-of-Thought: CoT fine-tuning uses cross-entropy loss on reasoning traces, teaching models to produce intermediate steps with high probability.
Paper 16 — Process Reward Models: Instead of one cross-entropy loss for the final answer, models are trained with cross-entropy on step-by-step correctness labels.
Paper 17 — LLaMA: Reports perplexity as the primary evaluation metric; trained with cross-entropy loss using causal language modelling.
10. Common mistakes
-
Using cross-entropy when you don’t have probabilities. The formula requires p values between 0 and 1 summing to 1. If your model outputs raw logits (scores, not probabilities), you must pass them through softmax first. Many libraries (like PyTorch’s
nn.CrossEntropyLoss) do this internally — passing pre-softmaxed probabilities into such functions will give wrong results. -
Confusing the base of the logarithm. Cross-entropy in information theory uses log base 2 (giving values in bits). In machine learning, we almost always use the natural logarithm (base e, giving values in nats). Results are not directly comparable across conventions. ML papers universally use natural log.
-
Forgetting the negative sign. log(p) for p ∈ (0,1) is always negative (since log of a fraction < 1 is negative). The negative sign in −log(p) makes the loss positive. Forgetting it gives a negative loss, which is nonsensical.
11. Try it yourself
Exercise 1: A model predicts [0.6, 0.3, 0.1] for three words [“chai”, “paani”, “dudh”]. The correct word is “paani”. Compute the cross-entropy loss.
Show answer
The correct word is “paani” (index 2), probability = 0.3.
L = −log(0.3) = −(−1.204) = 1.204
Note: if the model had given “chai” = 0.6, loss would have been −log(0.6) = 0.511 (lower). The model is being penalised for under-predicting “paani”.
Exercise 2: Model A predicts [0.90, 0.05, 0.05] and Model B predicts [0.40, 0.30, 0.30]. The correct class is class 1 (the first one). Which model has lower cross-entropy loss? By how much?
Show answer
Model A: L = −log(0.90) = 0.105
Model B: L = −log(0.40) = 0.916
Model A has much lower loss (0.105 vs 0.916) because it assigned 90% probability to the correct class, while Model B only assigned 40%. Model A is 8.7× lower loss.
Exercise 3: A language model achieves average cross-entropy loss of 2.5 nats on a test set. What is its perplexity? If a competing model has perplexity 9.0, which model is better?
Show answer
Perplexity = e^2.5 ≈ 12.18
The competing model with perplexity 9.0 is better (lower perplexity = lower average loss = model assigns higher probability to correct tokens).
In terms of loss: Loss = log(9.0) ≈ 2.197 nats. So the competing model achieves 2.197 nats vs our model’s 2.5 nats.
12. Interactive widget
Coming soon: Loss Explorer →
Adjust predicted probabilities for 5 words. See loss update instantly. Watch how the loss function’s gradient pushes probabilities toward the correct answer.
Previous tutorial: Probability Distributions ← Next tutorial: Conditional Probability → Used in: Paper 09 — Mixture of Experts →