The Idea: In-Context Learning
The core innovation of GPT-3 is in-context learning (ICL). The model learns from examples in the input prompt, without updating any weights.
What is In-Context Learning?
Instead of fine-tuning, you provide the task to the model in the prompt:
Example 1 (sentiment):
Review: "The movie was absolutely terrible."
Sentiment: negative
Example 2 (sentiment):
Review: "I loved every minute!"
Sentiment: positive
Example 3 (sentiment):
Review: "The ending was confusing."
Sentiment: ?
The model reads examples 1 and 2, infers the pattern (“the task is to classify reviews”), and generates the answer for example 3: negative.
No fine-tuning. No labeled dataset. No retraining. Just put examples in the prompt.
This is few-shot learning: learning from a few examples.
Zero-Shot, One-Shot, Few-Shot
Zero-shot: No examples. Just describe the task:
Classify the sentiment of this review:
Review: "I loved the movie."
Sentiment: ?
The model generates: positive (from its pre-trained knowledge of what “sentiment” means).
One-shot: One example:
Review: "I hated it."
Sentiment: negative
Review: "I loved it."
Sentiment: ?
The model generates: positive.
Few-shot: 2–5 examples (the most common use):
Review: "Terrible!"
Sentiment: negative
Review: "Amazing!"
Sentiment: positive
Review: "Not bad."
Sentiment: ?
The model generates: positive or neutral (few-shot gives more signal).
Why Does This Work?
During pre-training, GPT-3 absorbed patterns from 300 billion tokens of text. These patterns include:
- Linguistic structure: How sentences are built, grammar, word order.
- Semantic knowledge: What words mean, what entities are, factual information.
- Task patterns: How examples are formatted, what annotations look like.
- Reasoning: Arithmetic, logic, analogies, code.
When you show examples in the prompt, you’re not teaching the model. You’re activating pre-existing knowledge.
It’s like asking someone who speaks English natively, “Translate: ‘I love you’ to Hindi?” Without formal training, they might not know it perfectly, but their deep understanding of both languages helps them guess correctly (I love you ≈ Main tumhe pyaar karta hoon).
Pre-training is the deep knowledge. Prompt examples are the activation signal.
The Indian Analogy
Imagine a brilliant student who has read millions of books, articles, and documents in both Hindi and English. You ask them:
“Translate these sentences:
- ‘Hello’ = ‘Namaste’
- ‘Good morning’ = ‘Subah bakhair’
- ‘How are you?’ = ?”
The student has never formally studied translation, but from the examples and their vast reading, they infer the pattern (you’re translating English greetings to Hindi) and answer: “Kaisa ho?” or “Aap kaise hain?”
That’s in-context learning. The student’s pre-training (reading millions of documents) unlocks the ability to learn from prompt context.
In contrast, fine-tuning is like enrolling the student in a translation course: give them labeled example pairs (English-Hindi sentence pairs), have them practice, test them, repeat. It works, but it’s slower and requires explicit training.
How GPT-3 Is Different from GPT-1
GPT-1 (117M parameters) had the same architecture and objective. So why didn’t GPT-1 have strong in-context learning?
Scale. With 117M parameters and 40GB of training data, GPT-1 didn’t absorb enough knowledge. Its zero-shot and few-shot performance were weak. Fine-tuning was still necessary.
With 175B parameters and 300B tokens, GPT-3 has absorbed vastly more knowledge. In-context learning emerges as a capability.
This isn’t a new mechanism—it’s the same transformer decoder. It’s the quantity of knowledge that changes the quality of inference.
The Format Matters: Prompt Engineering
The way you write the prompt affects the result. This is called prompt engineering, and it becomes a skill.
Example 1 (bad prompt):
Review: great movie
Sentiment: positive
Review: bad food
Sentiment: negative
Classify this: nice book
Answer:
The model might output “positive” (correct) or get confused by the format.
Example 2 (good prompt):
Classify the sentiment of each review as positive, negative, or neutral.
Review: "The movie was absolutely wonderful!"
Sentiment: positive
Review: "The food was terrible."
Sentiment: negative
Review: "The book was okay, nothing special."
Sentiment: neutral
Review: "I loved the service!"
Sentiment:
The explicit instruction (“Classify the sentiment as positive, negative, or neutral”) and consistent formatting make the pattern clearer. GPT-3 is more likely to output “positive”.
Example 3 (best prompt):
You are a sentiment analysis expert. Your task is to classify customer reviews.
Review: "The movie was absolutely wonderful!"
Sentiment: positive
Review: "The food was terrible."
Sentiment: negative
Review: "The book was okay, nothing special."
Sentiment: neutral
Review: "I loved the service!"
Sentiment: positive
Even better: give the model a role (“sentiment analysis expert”), explicit instructions, and a consistent format. This is prompt engineering.
Why Fine-Tuning is No Longer Necessary
Fine-tuning updates weights to optimize for a task. But if you scale enough, the model already “knows” thousands of tasks implicitly. Prompt examples activate the right knowledge.
Think of it this way:
- Fine-tuned BERT: A student enrolled in a 2-week intensive course on sentiment analysis. After the course, they’re excellent at sentiment analysis, but might struggle with translation or summarization.
- Few-shot GPT-3: A student with so much background knowledge that when you give them 3 examples of sentiment classification, they activate their latent understanding and perform well. When you give them 3 examples of translation, they activate a different part of their knowledge. One student, many capabilities.
Emergent Abilities
Some things GPT-3 could do surprised everyone because the model was never explicitly trained on them:
- Arithmetic: 7 + 4 = ? (No arithmetic training examples, but the model learned from text that contains arithmetic discussion and reasoning.)
- Few-shot translation: Translate English to French with no explicit machine translation training (just examples in the prompt).
- Code generation: “Write a Python function to reverse a string.” (No labeled code-generation dataset used, but the model learned from StackOverflow and GitHub code that appeared in pre-training.)
- Multi-step reasoning: “If A > B and B > C, is A > C?” (Learned from logical reasoning in text, not from a logic dataset.)
These are called emergent abilities—they emerge from scale, not from explicit task training. They were latent in GPT-1 but too weak to be useful. At 175B parameters, they become strong enough to rely on.
Key Takeaways from This Section
- In-context learning is learning from examples in the prompt, not from weight updates.
- Zero/one/few-shot refer to the number of examples in the prompt.
- Pre-training unlocks knowledge; prompt examples activate it.
- Scale is the key. GPT-1 couldn’t do this well; GPT-3 can.
- Prompt engineering (how you write the prompt) affects output quality.
- Emergent abilities (arithmetic, code, translation) appear at large scale without explicit training.
Next: Section 04: The Math