Section 03

The Idea: In-Context Learning

Language Models are Few-Shot Learners 2020

The Idea: In-Context Learning

The core innovation of GPT-3 is in-context learning (ICL). The model learns from examples in the input prompt, without updating any weights.

What is In-Context Learning?

Instead of fine-tuning, you provide the task to the model in the prompt:

Example 1 (sentiment):
Review: "The movie was absolutely terrible."
Sentiment: negative

Example 2 (sentiment):
Review: "I loved every minute!"
Sentiment: positive

Example 3 (sentiment):
Review: "The ending was confusing."
Sentiment: ?

The model reads examples 1 and 2, infers the pattern (“the task is to classify reviews”), and generates the answer for example 3: negative.

No fine-tuning. No labeled dataset. No retraining. Just put examples in the prompt.

This is few-shot learning: learning from a few examples.

Zero-Shot, One-Shot, Few-Shot

Zero-shot: No examples. Just describe the task:

Classify the sentiment of this review:
Review: "I loved the movie."
Sentiment: ?

The model generates: positive (from its pre-trained knowledge of what “sentiment” means).

One-shot: One example:

Review: "I hated it."
Sentiment: negative

Review: "I loved it."
Sentiment: ?

The model generates: positive.

Few-shot: 2–5 examples (the most common use):

Review: "Terrible!"
Sentiment: negative

Review: "Amazing!"
Sentiment: positive

Review: "Not bad."
Sentiment: ?

The model generates: positive or neutral (few-shot gives more signal).

Why Does This Work?

During pre-training, GPT-3 absorbed patterns from 300 billion tokens of text. These patterns include:

  • Linguistic structure: How sentences are built, grammar, word order.
  • Semantic knowledge: What words mean, what entities are, factual information.
  • Task patterns: How examples are formatted, what annotations look like.
  • Reasoning: Arithmetic, logic, analogies, code.

When you show examples in the prompt, you’re not teaching the model. You’re activating pre-existing knowledge.

It’s like asking someone who speaks English natively, “Translate: ‘I love you’ to Hindi?” Without formal training, they might not know it perfectly, but their deep understanding of both languages helps them guess correctly (I love you ≈ Main tumhe pyaar karta hoon).

Pre-training is the deep knowledge. Prompt examples are the activation signal.

The Indian Analogy

Imagine a brilliant student who has read millions of books, articles, and documents in both Hindi and English. You ask them:

“Translate these sentences:

  • ‘Hello’ = ‘Namaste’
  • ‘Good morning’ = ‘Subah bakhair’
  • ‘How are you?’ = ?”

The student has never formally studied translation, but from the examples and their vast reading, they infer the pattern (you’re translating English greetings to Hindi) and answer: “Kaisa ho?” or “Aap kaise hain?”

That’s in-context learning. The student’s pre-training (reading millions of documents) unlocks the ability to learn from prompt context.

In contrast, fine-tuning is like enrolling the student in a translation course: give them labeled example pairs (English-Hindi sentence pairs), have them practice, test them, repeat. It works, but it’s slower and requires explicit training.

How GPT-3 Is Different from GPT-1

GPT-1 (117M parameters) had the same architecture and objective. So why didn’t GPT-1 have strong in-context learning?

Scale. With 117M parameters and 40GB of training data, GPT-1 didn’t absorb enough knowledge. Its zero-shot and few-shot performance were weak. Fine-tuning was still necessary.

With 175B parameters and 300B tokens, GPT-3 has absorbed vastly more knowledge. In-context learning emerges as a capability.

This isn’t a new mechanism—it’s the same transformer decoder. It’s the quantity of knowledge that changes the quality of inference.

The Format Matters: Prompt Engineering

The way you write the prompt affects the result. This is called prompt engineering, and it becomes a skill.

Example 1 (bad prompt):

Review: great movie

Sentiment: positive

Review: bad food

Sentiment: negative

Classify this: nice book

Answer:

The model might output “positive” (correct) or get confused by the format.

Example 2 (good prompt):

Classify the sentiment of each review as positive, negative, or neutral.

Review: "The movie was absolutely wonderful!"
Sentiment: positive

Review: "The food was terrible."
Sentiment: negative

Review: "The book was okay, nothing special."
Sentiment: neutral

Review: "I loved the service!"
Sentiment:

The explicit instruction (“Classify the sentiment as positive, negative, or neutral”) and consistent formatting make the pattern clearer. GPT-3 is more likely to output “positive”.

Example 3 (best prompt):

You are a sentiment analysis expert. Your task is to classify customer reviews.

Review: "The movie was absolutely wonderful!"
Sentiment: positive

Review: "The food was terrible."
Sentiment: negative

Review: "The book was okay, nothing special."
Sentiment: neutral

Review: "I loved the service!"
Sentiment: positive

Even better: give the model a role (“sentiment analysis expert”), explicit instructions, and a consistent format. This is prompt engineering.

Why Fine-Tuning is No Longer Necessary

Fine-tuning updates weights to optimize for a task. But if you scale enough, the model already “knows” thousands of tasks implicitly. Prompt examples activate the right knowledge.

Think of it this way:

  • Fine-tuned BERT: A student enrolled in a 2-week intensive course on sentiment analysis. After the course, they’re excellent at sentiment analysis, but might struggle with translation or summarization.
  • Few-shot GPT-3: A student with so much background knowledge that when you give them 3 examples of sentiment classification, they activate their latent understanding and perform well. When you give them 3 examples of translation, they activate a different part of their knowledge. One student, many capabilities.

Emergent Abilities

Some things GPT-3 could do surprised everyone because the model was never explicitly trained on them:

  • Arithmetic: 7 + 4 = ? (No arithmetic training examples, but the model learned from text that contains arithmetic discussion and reasoning.)
  • Few-shot translation: Translate English to French with no explicit machine translation training (just examples in the prompt).
  • Code generation: “Write a Python function to reverse a string.” (No labeled code-generation dataset used, but the model learned from StackOverflow and GitHub code that appeared in pre-training.)
  • Multi-step reasoning: “If A > B and B > C, is A > C?” (Learned from logical reasoning in text, not from a logic dataset.)

These are called emergent abilities—they emerge from scale, not from explicit task training. They were latent in GPT-1 but too weak to be useful. At 175B parameters, they become strong enough to rely on.


Key Takeaways from This Section

  • In-context learning is learning from examples in the prompt, not from weight updates.
  • Zero/one/few-shot refer to the number of examples in the prompt.
  • Pre-training unlocks knowledge; prompt examples activate it.
  • Scale is the key. GPT-1 couldn’t do this well; GPT-3 can.
  • Prompt engineering (how you write the prompt) affects output quality.
  • Emergent abilities (arithmetic, code, translation) appear at large scale without explicit training.

Next: Section 04: The Math