Section 05

Worked Example: Few-Shot Sentiment Classification

Language Models are Few-Shot Learners 2020

Worked Example: Few-Shot Sentiment Classification

Let’s trace through a concrete few-shot learning example step-by-step, showing how GPT-3 would process it.

The Task

Classify customer reviews into three categories: positive, negative, or neutral.

We’ll build the prompt with 3 examples (few-shot), then ask the model to classify a new review.

Step 1: Build the Prompt

You are a sentiment classification expert. Classify each review as positive, negative, or neutral.

Review: "The product arrived quickly and works perfectly."
Sentiment: positive

Review: "The item broke after one week. Very disappointing."
Sentiment: negative

Review: "The delivery took longer than expected, but the product is okay."
Sentiment: neutral

Review: "Best purchase I've made all year!"
Sentiment:

Step 2: Tokenization

The transformer doesn’t work on words; it works on tokens. Tokens are pieces of words. Let’s simplify and show the token sequence (actual GPT-3 uses a 50,257-token vocabulary):

Token sequence (simplified):
[You] [are] [a] [sentiment] [...] [positive] [Review] [:]
[The] [item] [broke] [after] [one] [week] [.] [Sentiment] [:]
[negative] [Review] [:] [Best] [purchase] [...] [Sentiment] [:]

Index:  0    1    2      3    ...   N-2      N-1    N     (where N-1 is the last token before [?])

The model’s job is to predict the token that comes after [Sentiment] [:] for the new review.

Step 3: Embedding and Encoding

Each token is converted to a vector (embedding) of dimension 12,288 for GPT-3:

Token "positive" → [0.123, -0.456, 0.789, ..., -0.234]  (12,288 numbers)
Token "negative" → [0.098, 0.341, -0.567, ..., 0.111]   (12,288 numbers)
Token "neutral"  → [0.211, 0.019, 0.876, ..., -0.445]   (12,288 numbers)

These embeddings are learned during pre-training and capture semantic meaning.

Step 4: Transformer Forward Pass (Simplified)

The transformer has 96 layers. Let’s trace through one layer conceptually:

Layer 1 Input: Token embeddings for the entire prompt (including the new review).

Attention (simplified): When the model processes the final [Sentiment] [:] token (for the new review), its attention mechanisms look at all previous tokens. Key things happen:

  1. Head 1: Might attend to the previous sentiment examples.

    • Computes similarity between “Best purchase I’ve made all year” and the examples.
    • “Best purchase” is similar to “quickly and works perfectly” (both positive).
    • Attention weight to the example’s “positive” label: high.
  2. Head 2: Might attend to task-related tokens.

    • Focuses on the pattern: [Review] [:] [words…] [Sentiment] [:] [label]
    • Recognizes the format.
  3. Head 3: Might attend to semantic similarities.

    • Identifies that “Best” and “purchase” are positive sentiment indicators.
    • Attention weight to the positive example: higher.

Feedforward: After attention, a feedforward network processes the output, adding nonlinearities and refinement.

Output of Layer 1: Updated embedding for the new review’s [Sentiment] position, conditioned on everything in the prompt.

Layers 2–96: This process repeats. Each layer refines the representation. By layer 96, the model has:

  • Recognized the task (sentiment classification)
  • Found similar examples (positive indicator words)
  • Prepared to output a sentiment label

Step 5: Output Projection and Softmax

After all 96 layers, the model projects the final hidden state to a probability distribution over all 50,257 tokens:

Hidden state after layer 96: h_final = [0.234, -0.567, ..., 0.123]  (12,288 numbers)

Linear projection: logits = W * h_final  (W is 12,288 × 50,257)

Softmax over all tokens:
P(token) = exp(logit_token) / Σ exp(logit_k) for all tokens k

For the sentiment labels, the softmax might produce:

P(positive) = 0.75   (from logit = 2.5)
P(negative) = 0.15   (from logit = 0.8)
P(neutral)  = 0.10   (from logit = 0.5)
[other tokens] ≈ 0.00

The model is 75% confident the answer is “positive”.

Step 6: Generation

The model generates the next token. There are two strategies:

Greedy: Pick the highest probability token.

argmax P(token) = "positive" (P = 0.75)

Sampling: Sample from the distribution (sometimes you want randomness).

Random sample from distribution:
Random value = 0.42 → cumulative P ≤ 0.42 → sample "positive"

For this review, the output is: positive.

Step 7: Verify with Post-Processing

The model might generate:

Best purchase I've made all year!
Sentiment: positive

A simple check: Does the output match the format? Yes. Is the output token in the valid set {positive, negative, neutral}? Yes.

Multiple Examples: Why Few-Shot Helps

Now let’s see why 3 examples are better than 0.

Zero-shot (no examples):

Classify the sentiment of the review:
Review: "Best purchase I've made all year!"
Sentiment:

The model must infer the task from general knowledge. It might output:

  • “positive” (correct, because it recognizes the enthusiastic language)
  • “good” (wrong, not in the format)
  • “amazing” (wrong, hallucination)

Accuracy: ~70%.

One-shot (one example):

Review: "Terrible product."
Sentiment: negative

Review: "Best purchase I've made all year!"
Sentiment:

The model sees the format (Review → Sentiment label). It’s now 80% confident to output “positive”.

Accuracy: ~82%.

Few-shot (three examples):

Review: "Works perfectly."
Sentiment: positive

Review: "Broke after one week."
Sentiment: negative

Review: "Delivery was okay, product is fine."
Sentiment: neutral

Review: "Best purchase I've made all year!"
Sentiment:

The model has:

  • Task format (clearly defined)
  • Multiple positive/negative/neutral examples (patterns)
  • Confidence: it recognizes the new review matches the positive examples

Accuracy: ~92%.

Why Attention Matters

In step 4, the model’s attention mechanism was crucial. Without it:

  • How would the model compare “Best purchase” to “Works perfectly”?
  • How would it associate the new review with the positive example?

Attention weights let different parts of the input inform the decision. One attention head focuses on word similarity, another on format, another on semantic role.

Edge Case: Ambiguous Review

What if the review were:

Review: "It works, but it is expensive."
Sentiment:

Now the model is torn:

  • “Works” is positive (matches example 1: “works perfectly”)
  • “Expensive” is negative (context suggests complaint)

The model might output:

P(positive) = 0.40
P(negative) = 0.45
P(neutral)  = 0.15

It thinks the review is negative (45% confidence). Or if using sampling, it might flip between negative and neutral.

This is realistic: the model reflects the ambiguity in the data.

Key Takeaways from This Section

  • Tokenization: Text → tokens → embeddings.
  • Attention: The model attends to examples and compares the test input to them.
  • Layers: 96 layers refine the representation, building up task understanding.
  • Projection: Final hidden state → logits → softmax → probability distribution.
  • Few-shot boost: More examples = higher accuracy (from ~70% to ~92% in this case).
  • Emergent patterns: The model learns from examples without explicit fine-tuning.

Next: Section 06: The Code