Paper 15
Intermediate

Training Language Models to Follow Instructions with Human Feedback

Training Language Models to Follow Instructions with Human Feedback

Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelley, Coleman, Zoph, Askell, Picciotto, Herbert-Voss, Engstrom, Olah, Krueger, Felsher, Telleen-Lawton, Conerly, Lanham, Nguyen, Henighan, Kadavath, Joseph, Brown, Clark, Song, Amodei, Sutskever, Christiano, Sam Altman · NeurIPS 2022
Read on arXiv


What This Paper Did

GPT-3 is powerful but misaligned. It follows the probability distribution of the internet, not what users actually want. Ask it to help with something, and it might refuse, ramble, hallucinate, or obey harmful instructions with equal likelihood.

This paper solves the alignment problem using three stages:

  1. Supervised Fine-Tuning (SFT): Train the model on human-written demonstration answers
  2. Reward Model (RM): Train a classifier to predict which outputs humans prefer
  3. Reinforcement Learning (RL): Use the reward model to optimize the policy with PPO (Proximal Policy Optimization)

The result is InstructGPT — a 1.3 billion parameter model that humans prefer over the 175 billion parameter GPT-3, despite being 130× smaller.

This is the origin of ChatGPT, Claude, and all modern aligned LLMs.

Key results:

Model                    | Rating
-------------------------|--------
GPT-3 (175B)            | 1.54 (baseline)
InstructGPT (1.3B)      | 4.53 (best)
InstructGPT vs GPT-3    | Preferred 72% of the time

Metric: 1–5 scale from human raters

The tiny model beats the giant because alignment matters more than raw capability.

Key concepts:

  • Alignment: Making models do what humans want, not just what’s probable
  • RLHF: Reinforcement Learning from Human Feedback
  • Reward Model: Learns human preferences from comparison data
  • PPO: A stable RL algorithm that doesn’t break the base model
  • KL Divergence Penalty: Keeps the RL model close to the SFT model (prevents catastrophic forgetting)
  • HHH: Helpful, Harmless, Honest — the alignment criteria
  • Three-stage pipeline: SFT → RM → PPO

What This Paper Did (Technical Overview)

The three-stage RLHF pipeline:

Stage 1: Supervised Fine-Tuning (SFT)

  • Collect human demonstrations: prompts + ideal answers
  • Fine-tune GPT-3 on these examples
  • Result: SFT model that follows instructions better than base GPT-3

Stage 2: Reward Model (RM)

  • Collect human preference comparisons: prompt + output A + output B → human rates which is better
  • Train a classifier to predict human preferences
  • Bradley-Terry model: models the probability that output A is better than output B
  • Result: reward model that mimics human judgment

Stage 3: Reinforcement Learning (RL)

  • Use PPO to optimize the SFT model against the reward model
  • KL divergence penalty prevents diverging too far from SFT
  • Objective: maximize reward while staying close to SFT baseline
  • Result: InstructGPT model aligned with human preferences

The Indian Analogy: Three Teachers

Imagine training a brilliant but undisciplined student for an exam:

Stage 1: Learning from a master teacher (SFT)

  • A master teacher (human expert) shows the student how to solve problems correctly
  • The student learns by imitation: “When you see this type of question, do this”
  • Student becomes much better but hasn’t internalized why this is the right way

Stage 2: Learning from a coach (Reward Model)

  • Now bring in a coach (not the master teacher, but an assistant)
  • The coach watches two attempts at the same problem
  • The coach judges: “Attempt A is better because it’s clearer and more efficient”
  • The coach rates dozens of pairs of attempts
  • The student learns the coach’s judgment pattern

Stage 3: Practicing with feedback (RL)

  • The student practices solving problems on their own
  • After each attempt, the coach scores it (that’s the reward)
  • The student adjusts their strategy to get higher scores from the coach
  • But here’s the catch: the coach is imperfect (learned from limited examples)
  • So the student doesn’t abandon the original master teacher’s lessons entirely
  • The student balances: follow the coach’s guidance, but stay somewhat true to the master’s principles

This balance (via KL divergence penalty) prevents two failures:

  1. The student forgets the master’s teachings and only optimizes for the coach’s scores
  2. The student ignores the coach and doesn’t improve

The Math Overview

Stage 1 (SFT): Standard supervised learning

Minimize cross-entropy loss: $$L_{\text{SFT}} = -\log \pi_{\text{SFT}}(y | x)$$

Where $\pi_{\text{SFT}}$ is the SFT model’s probability of generating correct response $y$ given prompt $x$.

Stage 2 (Reward Model): Bradley-Terry preference model

Loss for a single preference comparison (A is better than B): $$L_{\text{RM}} = -\log \sigma(r_\theta(x, y_A) - r_\theta(x, y_B))$$

Where:

  • $r_\theta(x, y)$ is the reward for (prompt, response) pair
  • $\sigma$ is the sigmoid function
  • We want $r(y_A) > r(y_B)$ when A is preferred

Stage 3 (RL): PPO with KL penalty

$$L_{\text{RL}} = -E[r_\theta(x, y)] + \beta \cdot KL[\pi_{\text{RL}}(y|x) || \pi_{\text{SFT}}(y|x)]$$

Where:

  • $r_\theta$ is the reward model
  • $\beta$ is the KL penalty coefficient (e.g., 0.01)
  • The KL term prevents the RL policy from diverging too far from SFT

Key Equations at a Glance

SFT Loss:
  L_sft = -log π_sft(y | x)

Reward Model Loss (Bradley-Terry):
  L_rm = -log σ(r_θ(x, y_w) - r_θ(x, y_l))
  where σ(z) = 1 / (1 + exp(-z)) [sigmoid]

RL Loss (PPO with KL penalty):
  L_rl = -r_θ(x, y) + β · KL[π_rl(y|x) || π_sft(y|x)]

KL Divergence:
  KL[P || Q] = Σ P(x) log(P(x) / Q(x))

Read in This Order

SectionWhat You Will LearnDifficultyTime
01 — ContextWhy alignment matters; the gap between capability and alignment🟢 beginner8 min
02 — The ProblemWhat’s wrong with training language models on internet text🟡 intermediate7 min
03 — The IdeaThree-stage RLHF pipeline; intuition behind each stage🟡 intermediate12 min
04 — The MathSFT loss, Bradley-Terry reward model, PPO objective with KL penalty🔴 advanced12 min
05 — Worked ExampleStep-by-step trace of reward model training on concrete preference pairs🟡 intermediate8 min
06 — The CodeImplementing reward model training and PPO-style updates🔴 advanced10 min
07 — LimitationsWhen RLHF fails; reward hacking; data requirements🟡 intermediate6 min
08 — ImpactChatGPT, Claude, modern alignment methods🟡 intermediate5 min
09 — SummaryOne-sentence recap, key numbers, what came next🟢 beginner3 min

Before You Read: Math You Need


Architecture: The RLHF Pipeline

STAGE 1: SUPERVISED FINE-TUNING (SFT)
    GPT-3 (175B)
         |
    [Human demonstrations]
    (prompt, ideal_answer) pairs
         |
    Fine-tune via SFT loss
         |
    π_SFT: SFT Model
    (better at following instructions)


STAGE 2: REWARD MODEL TRAINING
    [Human comparisons]
    (prompt, response_A, response_B) + label A or B
         |
    Train classifier: is A better than B?
         |
    r_θ: Reward Model
    (predicts human preference)


STAGE 3: REINFORCEMENT LEARNING (PPO)
    π_SFT → π_RL (via policy gradient)
    
    For each rollout:
      1. Generate response from π_RL
      2. Get reward from r_θ
      3. Compute RL gradient: maximize reward - KL penalty
      4. Update π_RL via PPO

    Result: InstructGPT
    (aligned with human preferences, smaller than GPT-3 but better)


ALIGNMENT EFFECT:
    GPT-3 (175B):           1.54 rating
    InstructGPT (1.3B):     4.53 rating  ← 130× smaller, 3× better rated!

Paper 14: Chain-of-Thought Prompting | Paper 16: Let’s Verify Step by Step (Self-Verification)

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.