Softmax Function

Intermediate
Before this tutorial: /math-tutorials/linear-algebra/vectors-introduction, /math-tutorials/probability/probability-basics

Softmax Function

1. What is this and why do we care?

If you learn only one function from all of modern AI, make it softmax.

Every time a language model predicts the next word — every time ChatGPT, Claude, or Gemini generates a response — the very last mathematical operation it performs before outputting anything is softmax. Every attention weight in the Transformer that powers these models passes through softmax. Every classification result from every neural network in production today ends with softmax.

Softmax does one thing: it takes a list of raw numbers (which could be negative, very large, very small — anything) and converts them into a proper probability distribution. That means all outputs are between 0 and 1, and they all add up to exactly 1.

Without softmax, neural networks could not make decisions. They would produce raw scores with no way to say “I am 80% confident this is a cat.”


2. Prerequisites

You need to be comfortable with vectors and know what a probability is. If you have not read Vectors — Introduction and Probability Basics, go through those first.


3. The intuition — before any symbols

Imagine a cricket match. The commentators on five different channels are rating India’s chance of winning with raw confidence scores (not percentages, just gut-feeling numbers):

Channel 1:  8.2
Channel 2:  3.1
Channel 3:  9.5
Channel 4:  2.0
Channel 5:  6.7

These numbers are useful but they are not probabilities — they do not sum to 1 and you cannot compare them across different matches. You need to convert them into fair shares.

One simple idea: divide each score by the total.

Total = 8.2 + 3.1 + 9.5 + 2.0 + 6.7 = 29.5

Channel 1 share: 8.2 / 29.5 = 0.278
Channel 2 share: 3.1 / 29.5 = 0.105
...

This works if all scores are positive. But what if someone gave a score of −4? You cannot take a negative share of something. Division by sum breaks down for negative numbers.

Softmax’s elegant fix: exponentiate first, then divide. The function eˣ (e to the power x) converts any number — positive, negative, zero — into a positive number. eˣ is always > 0, no matter what x is.

Once everything is positive, dividing by the sum works perfectly. That is the entire idea of softmax.


4. A tiny worked example with real numbers

Suppose a neural network produces three raw scores (called logits) for classifying an image as a cat, dog, or fish:

z = [2.0, 1.0, 0.1]
     cat   dog  fish

Step 1: Exponentiate each score.

e^2.0 = 7.389
e^1.0 = 2.718
e^0.1 = 1.105

Step 2: Sum the exponentials.

Sum = 7.389 + 2.718 + 1.105 = 11.212

Step 3: Divide each exponential by the sum.

softmax(cat)  = 7.389 / 11.212 = 0.659
softmax(dog)  = 2.718 / 11.212 = 0.242
softmax(fish) = 1.105 / 11.212 = 0.099

Verify: 0.659 + 0.242 + 0.099 = 1.000 ✓

Interpretation: The model is 65.9% confident this is a cat, 24.2% it is a dog, 9.9% a fish.


5. The general rule

Given a vector of n raw scores z = [z₁, z₂, …, zₙ]:

softmax(zᵢ) = exp(zᵢ) / (exp(z₁) + exp(z₂) + ... + exp(zₙ))

More compactly:

softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ)

Where:

  • exp(zᵢ) means e raised to the power zᵢ (also written eᶻⁱ)
  • Σⱼ means “sum over all j” — summing every term in the denominator
  • The result softmax(zᵢ) is always between 0 and 1
  • All results sum to exactly 1

6. Why exp? Why not just divide by the sum?

Two reasons. First, as mentioned: raw scores can be negative, and exp makes everything positive.

Second — and this is the deeper reason — exp amplifies differences. The highest score becomes disproportionately dominant. This is a feature, not a bug.

Watch what happens when you double all the scores (multiply by 2):

Original scores:    z = [2.0, 1.0, 0.1]
softmax output:         [0.659, 0.242, 0.099]

Doubled scores:     z = [4.0, 2.0, 0.2]
softmax output:         [0.867, 0.118, 0.015]

The winner (cat) went from 65.9% to 86.7%. The others collapsed. Softmax is sharper when scores are larger. In Transformers, this sharpness is controlled explicitly — more on that in the temperature discussion below.

Compare this to simple division by sum (no exp):

[2.0, 1.0, 0.1] / 3.1 = [0.645, 0.323, 0.032]

This looks reasonable but breaks for negatives. Try z = [2.0, −1.0, 0.1]: sum = 1.1, division gives [1.818, −0.909, 0.091] — negative probabilities! Nonsense. Exp avoids this entirely.


7. A bigger example — language model vocabulary

A language model needs to pick the next word from a vocabulary of 5 words. It computes these logits:

Word:    "tea"   "chai"  "water"  "milk"  "juice"
Logit:    3.2     4.1     1.5      2.8     0.6

Apply softmax:

exp values:
  "tea":   e^3.2 = 24.53
  "chai":  e^4.1 = 60.34
  "water": e^1.5 =  4.48
  "milk":  e^2.8 = 16.44
  "juice": e^0.6 =  1.82

Sum = 24.53 + 60.34 + 4.48 + 16.44 + 1.82 = 107.61

softmax probabilities:
  "tea":   24.53 / 107.61 = 0.228
  "chai":  60.34 / 107.61 = 0.561
  "water":  4.48 / 107.61 = 0.042
  "milk":  16.44 / 107.61 = 0.153
  "juice":  1.82 / 107.61 = 0.017

Check: 0.228 + 0.561 + 0.042 + 0.153 + 0.017 = 1.001 ✓ (tiny rounding error)

“Chai” had the highest raw score (4.1) and wins convincingly (56.1%). Greedy sampling would pick “chai.”


8. Temperature: making softmax sharper or flatter

When you use ChatGPT with the “temperature” setting, this is what is happening. Softmax is applied to z / T where T is the temperature.

softmax_with_temperature(zᵢ) = exp(zᵢ / T) / Σⱼ exp(zⱼ / T)
  • T = 1 (default): normal softmax
  • T → 0 (very cold): probabilities become sharper, almost all weight on the winner — more predictable, robotic responses
  • T > 1 (warm/hot): probabilities become flatter, more spread out — more creative, random, sometimes surprising responses

Example with the same logits [3.2, 4.1, 1.5, 2.8, 0.6]:

T = 0.5:  "chai" probability → ~0.80  (very confident, robotic)
T = 1.0:  "chai" probability → ~0.56  (normal)
T = 2.0:  "chai" probability → ~0.35  (more diverse, creative)

This is why AI chatbots have a temperature dial — it directly controls the shape of the softmax output.


9. Softmax in attention mechanisms

In Papers 07 and 08, softmax is used to compute attention weights. The idea:

  1. Compute a raw “relevance score” for each source word (a single number per word)
  2. Apply softmax across all scores → get attention weights (probabilities summing to 1)
  3. Use those weights to take a weighted average of the source word representations

The attention weight tells the decoder: “Pay this much attention to word i.” Softmax ensures these attention percentages always add up to 100%.

If three source words have raw attention scores [1.5, 3.2, 0.8]:

exp values:  [4.48, 24.53, 2.23]
Sum:          31.24

Attention weights:  [0.143, 0.785, 0.071]

The second word gets 78.5% of the attention. You will see this pattern throughout Papers 07 and 08.


10. Common mistakes

  • Confusing logits and probabilities. The input to softmax is raw scores (logits). The output is probabilities. Never apply softmax to probabilities — you’d be squashing numbers that are already valid.

  • Applying softmax element-wise without dividing by the sum. Some students think softmax just means “apply exp to each element.” That is only half of it. You must divide by the total. Without the division, nothing sums to 1.

  • Numerical overflow. In code, exp(z) can overflow to infinity for large z. The standard trick is to subtract the maximum value first: softmax(z - max(z)). Mathematically identical (the max cancels), numerically stable. Your framework handles this automatically, but good to know.

  • Softmax in multi-class vs binary. For two classes, softmax reduces to the sigmoid function. You will sometimes see sigmoid used for binary classification and softmax for multi-class. They are related, not different ideas.


11. Try it yourself

Exercise 1: A network produces logits [1.0, 2.0, 3.0] for three classes. Compute softmax by hand.

Show answer

exp(1.0) = 2.718 exp(2.0) = 7.389 exp(3.0) = 20.086

Sum = 30.193

softmax = [2.718/30.193, 7.389/30.193, 20.086/30.193] = [0.090, 0.245, 0.665]

Check: 0.090 + 0.245 + 0.665 = 1.000 ✓


Exercise 2: What happens if all logits are equal? Say z = [2.0, 2.0, 2.0, 2.0]. What do you expect softmax to output before calculating? Then verify.

Show answer

Expectation: when all scores are equal, no word is preferred, so all probabilities should be equal → [0.25, 0.25, 0.25, 0.25].

Verification: exp(2.0) = 7.389 for all four entries. Sum = 4 × 7.389 = 29.556 Each softmax value = 7.389 / 29.556 = 0.250

This is a uniform distribution — the model is equally uncertain about all four options. In attention, this would mean “pay equal attention to all words” — no focus at all.


Exercise 3: Attention scores for five words are [0.3, 1.9, 0.7, 1.2, 0.4]. Which word gets the most attention? Compute softmax and find the exact percentage.

Show answer

The second word (score 1.9) will dominate.

exp values: [e^0.3, e^1.9, e^0.7, e^1.2, e^0.4] = [1.350, 6.686, 2.014, 3.320, 1.492]

Sum = 14.862

softmax = [0.091, 0.450, 0.135, 0.223, 0.100]

The second word gets 45.0% of the attention. This is the sharpening effect of exp at work — the second word’s score (1.9) was only 1.6 points higher than the first (0.3), but it received nearly 5 times the attention weight (0.450 vs 0.091).


12. Interactive widget

Coming soon: Softmax Explorer →

Drag logit sliders for 5 classes. Watch the probability bars update live. Try the temperature slider to see creative vs conservative behaviour.


Previous tutorial: Matrix Multiplication ← Next tutorial: Probability Distributions → Used in: Paper 07 — Attention / Bahdanau →