Section 04

The Math: Language Modeling and In-Context Learning

Language Models are Few-Shot Learners 2020

The Math: Language Modeling and In-Context Learning

GPT-3 uses the same objective function as GPT-1. The innovation is scale, not mathematics. But understanding the math clarifies how in-context learning works.

Prerequisites: Cross-Entropy Loss, Conditional Probability

The Objective: Causal Language Modeling

The model learns to predict the next token given all previous tokens. This is called causal language modeling (you only attend to past context).

Objective:
L = -1/N * Σ log P(u_i | u_1, u_2, ..., u_{i-1})

where:
  u_i        = the i-th token in the sequence
  N          = total number of tokens
  P(u_i | u_1,...,u_{i-1}) = probability the model assigns to token u_i 
                              given all previous tokens
  log P(...) = log-probability (smaller values = model is confident)
  The negative sign and sum = cross-entropy loss

We minimize this loss: the model learns to assign high probability to the correct next token.

Worked Example: Computing Cross-Entropy Loss

Sequence: “I love cats”
Tokens: [I, love, cats]

Let’s compute the loss. Assume the vocabulary has 50,000 words.

Step 1: Predict token 2 (love) from token 1 (I)

The model outputs probabilities over all 50,000 words. Let’s say:

  • P(love | I) = 0.3
  • P(dogs | I) = 0.2
  • P(hate | I) = 0.1
  • [all other words share remaining 0.4]

The correct token is “love”. Loss contribution:

L_1 = -log(0.3) = -(-1.204) = 1.204

(Higher probability → lower loss. If P(love|I) were 0.9, loss would be -log(0.9) = 0.105.)

Step 2: Predict token 3 (cats) from tokens 1–2 (I love)

The model computes:

  • P(cats | I, love) = 0.5
  • P(dogs | I, love) = 0.2
  • P(people | I, love) = 0.1
  • [remaining 0.2]

The correct token is “cats”. Loss contribution:

L_2 = -log(0.5) = 0.693

Step 3: Total Loss

Average loss over 2 tokens:

L = (L_1 + L_2) / 2 = (1.204 + 0.693) / 2 = 0.9485

The model learns by minimizing this. If it can increase P(love|I) from 0.3 to 0.8 and increase P(cats|I,love) from 0.5 to 0.9, the loss drops to:

L = (-log(0.8) - log(0.9)) / 2 = (0.223 + 0.105) / 2 = 0.164

Much lower loss = better model.

In-Context Learning: Conditional Probability

During inference (generation), the model doesn’t change its weights. Instead, the prompt conditions the probability distribution.

For a sentiment task:

Prompt examples:
[Review: "great movie", Sentiment: positive]
[Review: "bad food", Sentiment: negative]

Task:
[Review: "nice book", Sentiment: ?]

The full input is a sequence of tokens:

[Review:] [great] [movie] [Sentiment:] [positive] [Review:] [bad] [food] [Sentiment:] [negative] [Review:] [nice] [book] [Sentiment:] [?]

The model predicts the next token after “Sentiment:”. It computes:

P(next token | all previous tokens in the prompt)

Because the previous tokens include examples, the model’s distribution shifts. The model learns:

  • From examples: sentiment tasks show review → sentiment pairs
  • Pattern inference: the prompt shows positive after “great”, negative after “bad”
  • Activation: when it sees “nice”, it activates positive because the pattern matches “great”

All of this happens in the forward pass (inference). No weight updates.

Formal Definition

In-context learning on a task with examples (x_1, y_1), …, (x_k, y_k) and a test input x_test:

Input sequence: [x_1, y_1, x_2, y_2, ..., x_k, y_k, x_test]

Output prediction: arg max P(y | x_1, y_1, ..., x_k, y_k, x_test)
                   y

The model predicts y by assigning the highest probability to the likely completion.

The model is trained on the objective:

L = -Σ log P(u_i | u_1, ..., u_{i-1})

applied to all training sequences. So it’s trained to predict the next token given context. During in-context learning, the “context” includes prompt examples.

Why Scale Enables In-Context Learning

Smaller models (GPT-1, 117M) can do in-context learning weakly because they have limited capacity to store knowledge. When asked to hold both the task pattern and generate the answer, they fail often.

Larger models (GPT-3, 175B) can hold the task pattern in attention and in the hidden states while generating the answer. They have enough capacity:

  • To store patterns about what a sentiment classifier should do
  • To recognize the task from examples
  • To apply the pattern to new inputs

Mathematically, this isn’t a different mechanism. It’s the same transformer forward pass. But the capacity allows the pattern-matching to work.

The Attention Mechanism’s Role

The transformer’s self-attention layer is key to in-context learning:

Query:   q = W_q * h_i          (current token representation)
Key:     k = W_k * h_j          (all previous tokens)
Value:   v = W_v * h_j          (all previous tokens)

Attention weights: α_ij = softmax( (q · k) / √d )

Output: Σ α_ij * v_j            (weighted sum of values)

When the model attends to the prompt examples (high α for example tokens), it learns from them. When it attends to the test input, it applies that learning.

The 96 attention heads in GPT-3 allow different parts of the model to attend to different aspects simultaneously: one head might focus on the task format, another on semantic similarity between examples and the test input.

Worked Example: Attention in In-Context Learning

Consider the input sequence (simplified, using position-based indexing):

Position 0: "Review:"
Position 1: "great"
Position 2: "movie"
Position 3: "Sentiment:"
Position 4: "positive"
Position 5: "Review:"
Position 6: "bad"
Position 7: "food"
Position 8: "Sentiment:"
Position 9: "negative"
Position 10: "Review:"
Position 11: "nice"
Position 12: "book"
Position 13: "Sentiment:"
Position 14: ?

When the model generates the token at position 14, it:

  1. Computes attention weights over positions 0–13 (all previous tokens)
  2. Might attend heavily to position 4 (positive) and position 9 (negative) because they’re example sentiments
  3. Computes semantic similarity between “nice book” (positions 11–12) and “great movie” (positions 1–2), finding them similar
  4. Copies activation patterns from position 4 (positive) to predict “positive” at position 14

This is all done in the forward pass, via attention. No weight updates.

Key Equations Summary

Causal Language Model Loss:
L = -1/N * Σ log P(u_i | u_1, ..., u_{i-1})

In-Context Learning Setup:
Input: [Example tokens] + [Test input tokens]
Output: P(y_test | example tokens + test input context)

Attention:
Attention(Q, K, V) = softmax((Q * K^T) / √d_k) * V

Full Transformer:
output = Attention(input, input, input) + input  [+ layer norm]
output = FFN(output) + output                     [+ layer norm]
(repeat 96 times for GPT-3)

No new equations compared to GPT-1. The innovation is in how scale enables these mechanisms to work powerfully.


Key Takeaways from This Section

  • Objective: Minimize cross-entropy loss on causal language modeling.
  • In-context learning: Prompt examples condition the probability distribution; the model learns via attention in the forward pass.
  • No fine-tuning: All learning happens through the input prompt, not weight updates.
  • Attention is the mechanism: Different heads attend to different parts of the prompt examples and test input.
  • Scale enables capacity: 175B parameters allow the model to hold both task patterns and generate answers.

Next: Section 05: Worked Example