Probability Distributions

Intermediate
Before this tutorial: /math-tutorials/probability/probability-basics

Probability Distributions

1. What is this and why do we care?

A language model does not output a single answer. It outputs a distribution — a complete menu of possibilities, each with a probability attached.

Understanding what a distribution is, and which shapes those distributions can take, is essential for reading any paper from Paper 07 onwards. When researchers say “the model predicts a Gaussian distribution over the next token” or “sample from the output distribution,” these are not metaphors. They are precise mathematical statements about specific shapes of probability.


2. Prerequisites

You need Probability Basics. Everything builds on that. You also need to know what the softmax function does — read Softmax Function if you have not.


3. The intuition — before any symbols

Think of a distribution as a scoreboard of likelihoods.

At your local cricket ground, if five teams play a tournament, you might guess their chances of winning before the season:

Team A (reigning champion): 40%
Team B (strong new team):   25%
Team C (unpredictable):     20%
Team D (average):           10%
Team E (newly promoted):     5%

This is a probability distribution over five outcomes. All probabilities sum to 100% (= 1.0). The shape — how peaked or flat the distribution is — tells you how confident you are. A flat distribution (all 20%) means “I have no idea.” A sharp, peaked distribution (one team at 95%) means “I am almost certain.”

Every AI decision is just a machine picking from some distribution. The entire job of training a neural network is to shape its distributions to be peaked in the right places.


4. Discrete distributions

A discrete distribution applies when outcomes can be counted: words in a vocabulary, categories, labels. The probabilities are defined on a list.

4a. The Bernoulli distribution — exactly two outcomes

The simplest distribution. One parameter: p (the probability of “success”).

P(outcome = 1) = p
P(outcome = 0) = 1 − p

Example: Does the next word start with a capital letter?

p = 0.12   (12% of words in a text start with a capital)

P(yes) = 0.12
P(no)  = 0.88

Check: 0.12 + 0.88 = 1.00 ✓

In AI: Bernoulli distributions appear in binary classification — spam or not spam, toxic or not toxic, grammatical or not. The output neuron of a binary classifier passes through a sigmoid function (which you can think of as softmax for two classes) and produces a Bernoulli distribution.


4b. The Categorical distribution — multiple outcomes

This is the most important distribution in language modelling. It generalises Bernoulli to K outcomes (K > 2).

Parameters: [p₁, p₂, …, p_K] where every pᵢ ≥ 0 and Σ pᵢ = 1.

Example: Predicting the next word from a tiny vocabulary of 6 words.

Vocabulary:   "chai"  "piyo"  "kal"  "aaj"  "nahi"  "haan"
Probability:   0.40    0.25   0.15   0.12    0.05    0.03

Check: 0.40 + 0.25 + 0.15 + 0.12 + 0.05 + 0.03 = 1.00 ✓

This is a categorical distribution over 6 words. In a real language model, the vocabulary has 30,000 to 100,000 words, but the idea is identical: a huge list of probabilities summing to 1.

How does a neural network produce a categorical distribution?

The network outputs raw scores (logits), one per word in the vocabulary. The softmax function converts these logits into probabilities that sum to 1. The output of softmax is exactly a categorical distribution. This is why softmax and categorical distributions always appear together.

Worked example:

Three logits: z = [2.1, 0.8, 1.5] for words [“tea”, “milk”, “water”].

exp(2.1) = 8.166
exp(0.8) = 2.226
exp(1.5) = 4.482

Sum = 14.874

p₁ = 8.166 / 14.874 = 0.549  ("tea")
p₂ = 2.226 / 14.874 = 0.150  ("milk")
p₃ = 4.482 / 14.874 = 0.301  ("water")

Check: 0.549 + 0.150 + 0.301 = 1.000 ✓

Categorical distribution over 3 words: [0.549, 0.150, 0.301]. To generate a word, you sample from this distribution — “tea” gets picked about 55% of the time.


4c. The Uniform distribution — maximum uncertainty

When all outcomes are equally likely, the distribution is uniform.

For K outcomes:

P(each outcome) = 1 / K

Example: Picking a random day of the week.

P(Monday) = P(Tuesday) = ... = P(Sunday) = 1/7 ≈ 0.143

In AI, a uniform output distribution is the worst case — it means the model has learned nothing. If a language model assigns equal probability to every word, it has no idea what comes next. During training, models start near-uniform and gradually sharpen their distributions toward correct answers.

The “entropy” (a measure of uncertainty) of a distribution is highest when it is uniform and lowest when it is a spike on one outcome. You will encounter this in the Information Theory tutorials.


5. Continuous distributions

A continuous distribution applies when outcomes are not countable but form a continuous range: height, temperature, a score between −∞ and +∞.

5a. The Normal (Gaussian) distribution

The most important continuous distribution in mathematics. Its shape is the classic bell curve.

Two parameters:

  • μ (mu, pronounced “myoo”) — the mean, where the bell is centred
  • σ (sigma) — the standard deviation, how wide the bell is
The bell curve formula:
p(x) = (1 / (σ√(2π))) × exp(−(x − μ)² / (2σ²))

You do not need to memorise this formula. What you need to know is what it means:

  • Values near μ are most likely
  • Values far from μ (many standard deviations away) are rare
  • The distribution is symmetric around μ

Example: Heights of students in a class.

μ = 165 cm (average height)
σ = 8 cm (standard deviation)

P(student is between 157 and 173 cm) ≈ 68%   (within 1 standard deviation)
P(student is between 149 and 181 cm) ≈ 95%   (within 2 standard deviations)
P(student is below 141 cm or above 189 cm) ≈ 0.3%  (more than 3 SD away — rare)

This “68-95-99.7 rule” works for any Gaussian:

Within 1σ of the mean: ~68% of the data
Within 2σ of the mean: ~95% of the data
Within 3σ of the mean: ~99.7% of the data

In AI: The Gaussian appears in:

  • Weight initialisation (neural network weights are often initialised from N(0, 0.01))
  • Variational autoencoders (models a distribution in latent space)
  • Scaling laws (noise in training loss follows a Gaussian shape)
  • Diffusion models (each step adds Gaussian noise)

5b. Comparing distributions: sharp vs flat

The key insight about distributions in AI is their sharpness.

Imagine two distributions over temperatures for a summer day in Delhi:

Distribution A: μ = 40°C, σ = 1°C   — very sharp, almost certainly 40°C ± a little
Distribution B: μ = 40°C, σ = 10°C  — flat, could easily be 30°C or 50°C

In language models:

  • A sharp (low-entropy) distribution = confident model, low temperature
  • A flat (high-entropy) distribution = uncertain model, high temperature

Training pushes the model to have sharp distributions at the right answers. If the correct next word is “chai,” training pushes its probability up and all others down — sharpening the distribution around the correct outcome.


6. How distributions connect to training

Every training step of a neural network compares two distributions:

  1. The model’s predicted distribution — what the model thinks the next word should be (output of softmax)
  2. The true distribution — what the next word actually is (a spike at the correct word, zero everywhere else)

The gap between these two distributions is measured by cross-entropy loss (covered in its own tutorial). The training process minimises this gap.

In symbols:

True distribution:
  P(word = "chai") = 1.0
  P(word = anything else) = 0.0

Model's predicted distribution (before training):
  P(word = "chai")  = 0.15
  P(word = "piyo")  = 0.30
  P(word = "nahi")  = 0.55
  ...

Cross-entropy loss = −log(0.15) = 1.897  (high loss — model is wrong)

After training:
  P(word = "chai")  = 0.82
  Cross-entropy loss = −log(0.82) = 0.198  (low loss — model is improving)

The model’s distribution is gradually shaped to match the true distribution. That is what training is.


7. Where distributions appear in the papers

Papers 07–08 (Attention, Transformer): The decoder outputs a categorical distribution over the vocabulary at each decoding step. Attention weights themselves are a categorical distribution over source positions.

Paper 10 (GPT-1): Autoregressive language modelling — at every position in the text, the model predicts a categorical distribution over the next word.

Paper 13 (Scaling Laws): Analyzes how the distribution of training losses changes with scale. Uses concepts from statistics about how distributions scale with data size.

Paper 15 (RLHF): The policy (the language model) is treated as a distribution over responses. Reinforcement learning steers this distribution toward high-reward responses.

Paper 22 (Claude Model Card): Safety evaluations measure the probability mass the model’s distribution places on harmful outputs.


8. Common mistakes

  • Probability vs probability density. For continuous distributions, p(x) is a probability density, not a probability. The probability of a specific exact value (like exactly 40.000000°C) is technically zero. Probability is defined over intervals: P(39°C < x < 41°C). For discrete distributions, P(x = “chai”) is a real probability. This distinction matters in the maths of Papers 13, 15, and 22.

  • Mixing up mean and mode. For a Gaussian, the mean (μ) and the mode (most common value) are the same. But for a skewed distribution, they differ. In AI, models sometimes output the mean and sometimes the mode of their predicted distribution — these can be very different outputs.

  • Forgetting that distributions must sum/integrate to 1. If you are checking whether a list of probabilities is a valid distribution, always sum them. If the sum is not 1.0, something has gone wrong — either in your computation or in the model.


9. Try it yourself

Exercise 1: A model predicts three words with logits [3.0, 1.0, 2.0]. Compute the categorical distribution and identify which word has the highest probability.

Show answer

exp(3.0) = 20.086 exp(1.0) = 2.718 exp(2.0) = 7.389

Sum = 30.193

p₁ = 20.086 / 30.193 = 0.665 ← highest p₂ = 2.718 / 30.193 = 0.090 p₃ = 7.389 / 30.193 = 0.245

Check: 0.665 + 0.090 + 0.245 = 1.000 ✓

Word 1 has the highest probability (66.5%).


Exercise 2: A model has μ = 170 cm and σ = 6 cm for height prediction. Using the 68-95-99.7 rule: (a) What range contains 95% of predicted heights? (b) What is the probability that the predicted height is above 182 cm?

Show answer

(a) 95% lies within 2 standard deviations of the mean. 2σ = 2 × 6 = 12 cm Range: 170 − 12 = 158 cm to 170 + 12 = 182 cm

(b) P(height > 182 cm) = P(height > μ + 2σ) The 95% within 2σ means 5% is outside 2σ. By symmetry, 2.5% is above 182 cm and 2.5% is below 158 cm. P(height > 182 cm) ≈ 2.5% = 0.025


Exercise 3: A language model is equally uncertain about 4 words: P = [0.25, 0.25, 0.25, 0.25]. After one training step on the correct answer (“chai” = word 1), the probabilities update to [0.55, 0.20, 0.15, 0.10]. Has the distribution become sharper or flatter? What does the cross-entropy loss tell us, qualitatively, about before vs after?

Show answer

Sharper. The distribution now concentrates more mass on word 1 (55% vs 25% before).

Cross-entropy loss before = −log(0.25) = −log(1/4) = log(4) ≈ 1.386 (high)

Cross-entropy loss after = −log(0.55) ≈ 0.598 (lower — model improved)

The lower loss confirms the distribution has sharpened in the right direction. Training aims to push this further — ideally toward P = [1.0, 0.0, 0.0, 0.0] where loss = −log(1.0) = 0.


10. Interactive widget

Coming soon: Distribution Explorer →

Adjust μ and σ for a Gaussian. Add words and change logits for a categorical. Watch how sharpness changes with temperature.


Previous tutorial: Probability Basics ← Next tutorial: Conditional Probability → Also useful: Softmax Function → Used in: Paper 07 — Attention / Bahdanau →