Section 09

Summary: One-Page Recap

Language Models are Few-Shot Learners 2020

Summary: GPT-3 at a Glance

One-Sentence Version

GPT-3 proved that scaling a transformer language model to 175 billion parameters enables in-context learning: the model learns new tasks from examples in the prompt, without any fine-tuning.


The Problem

By 2020, NLP relied on fine-tuning: pre-train a model, then train it on task-specific labeled data for each new task. This required:

  • Expensive labeling (1,000–10,000 examples per task)
  • Separate models for each task
  • Retraining when the data distribution shifted

Fine-tuning scaled poorly for companies with many tasks.


The Key Ideas

  1. In-context learning (ICL): Provide task examples in the prompt; the model learns from context alone, with no weight updates.

  2. Zero/one/few-shot: Depending on the number of examples in the prompt:

    • Zero-shot: No examples, just describe the task
    • One-shot: One example
    • Few-shot: 2–5 examples (most common and effective)
  3. Scale unlocks capability: Language models at 175B parameters gain abilities that smaller models (117M, 340M) don’t have. This emergence of new capabilities at scale is a central insight.

  4. One model, many tasks: Instead of task-specific fine-tuned models, one large model handles many tasks by adapting to the prompt.

  5. Prompt engineering matters: The exact wording and format of the prompt affects output quality significantly.


Key Numbers

MetricValue
Parameters175 billion
Layers96
Attention heads96
Hidden dimension12,288
Training tokens300 billion
Training data sourcesCommon Crawl, WebText2, Books, Wikipedia
Training compute~3,640 GPU-years
Training cost~$5–10 million USD
Context window~2,000 tokens
Vocabulary size50,257 tokens

The Math (Brief)

Objective: Minimize cross-entropy loss on causal language modeling.

L = -1/N * Σ log P(u_i | u_1, ..., u_{i-1})

Same as GPT-1. The innovation is scale, not mathematics.

How it works:

  • Pre-training: Learn to predict the next token from all previous tokens
  • Inference (few-shot): Provide task examples in the prompt
  • The model’s attention mechanisms recognize the task pattern and apply it to new examples
  • No weight updates; all learning is in-context

The Indian Analogy

A brilliant student with deep knowledge from reading millions of books. You show them 3–5 examples of a new task (say, sentiment classification), and without any formal training, they figure out the pattern and apply it. The examples activate latent knowledge.

In contrast, traditional fine-tuning is like enrolling the student in a training course: you give them labeled examples, they practice, you test them, they improve. It works but is slower.


What It Could Do

  • Sentiment analysis: Classify text as positive/negative/neutral with few-shot examples
  • Translation: Translate between languages with a few examples (no MT training)
  • Arithmetic: Solve simple math problems (added, though not reliably)
  • Code generation: Write short programs from English descriptions
  • Q&A: Answer questions with in-context knowledge
  • Summarization: Summarize text (with varying quality)
  • Reasoning: Multi-step logic (weak, but present)

What It Struggled With

  • Factual accuracy: Hallucinations (generating plausible but false information)
  • Complex reasoning: Multi-step logic problems
  • Prompt sensitivity: Small wording changes cause different outputs
  • Learning from feedback: Can’t improve within a conversation
  • Limited context: Only attends to ~2,000 tokens at a time
  • Cost: Expensive to train and run

What Changed Because of GPT-3

  1. ChatGPT (2022): Fine-tuned GPT-3 for conversation → mainstream AI adoption
  2. Copilot (2021): Code generation with Codex (GPT-3 fine-tune)
  3. Scaling focus: The entire field pivoted to studying scaling laws
  4. Prompt engineering: A new discipline emerged
  5. API-first business model: OpenAI monetized via API access
  6. Open-source alternatives: BLOOM, LLaMA, Mistral emerged to compete
  7. Safety research: Alignment and truthfulness became urgent
  8. Industry adoption: Thousands of startups built on GPT-3

Key Papers Citing This Work

  • InstructGPT (Ouyang et al., 2022): Fine-tuned GPT-3 with human feedback
  • ChatGPT (OpenAI, 2022): Public version of InstructGPT
  • Scaling Laws for Neural Language Models (Kaplan et al., 2020): Studied why scale works (see Paper 13)
  • Chain-of-Thought Prompting (Wei et al., 2022): Improved reasoning by asking the model to think step-by-step
  • Constitutional AI (Bai et al., 2022): Fine-tuning with principles instead of examples
  • LLaMA (Touvron et al., 2023): Open-source alternatives to GPT-3

In this series:

  • Paper 13: Scaling Laws for Neural Language Models — Why GPT-3 works: the math of how performance scales with parameters and data
  • Paper 14: Chain-of-Thought Prompting — How to make GPT-3 reason better
  • Paper 15: InstructGPT — How to fine-tune GPT-3 to follow instructions better

Outside this series:


Bottom Line

GPT-3 proved that scale is the primary lever for AI capability. A single 175-billion-parameter model, trained on diverse text, can do dozens of tasks without fine-tuning, just from examples in the prompt. This insight shaped everything that followed in large language models: ChatGPT, GPT-4, Claude, Gemini, and the entire modern LLM ecosystem.

The paradigm shifted from “fine-tune for each task” to “prompt one giant model.” The implications are still unfolding.


← Back to Paper 12 Overview

Read related papers:

Return to series:

🎉 You've finished this paper!