Training Language Models to Follow Instructions with Human Feedback
Training Language Models to Follow Instructions with Human Feedback
Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelley, Coleman, Zoph, Askell, Picciotto, Herbert-Voss, Engstrom, Olah, Krueger, Felsher, Telleen-Lawton, Conerly, Lanham, Nguyen, Henighan, Kadavath, Joseph, Brown, Clark, Song, Amodei, Sutskever, Christiano, Sam Altman · NeurIPS 2022
Read on arXiv
What This Paper Did
GPT-3 is powerful but misaligned. It follows the probability distribution of the internet, not what users actually want. Ask it to help with something, and it might refuse, ramble, hallucinate, or obey harmful instructions with equal likelihood.
This paper solves the alignment problem using three stages:
- Supervised Fine-Tuning (SFT): Train the model on human-written demonstration answers
- Reward Model (RM): Train a classifier to predict which outputs humans prefer
- Reinforcement Learning (RL): Use the reward model to optimize the policy with PPO (Proximal Policy Optimization)
The result is InstructGPT — a 1.3 billion parameter model that humans prefer over the 175 billion parameter GPT-3, despite being 130× smaller.
This is the origin of ChatGPT, Claude, and all modern aligned LLMs.
Key results:
Model | Rating
-------------------------|--------
GPT-3 (175B) | 1.54 (baseline)
InstructGPT (1.3B) | 4.53 (best)
InstructGPT vs GPT-3 | Preferred 72% of the time
Metric: 1–5 scale from human raters
The tiny model beats the giant because alignment matters more than raw capability.
Key concepts:
- Alignment: Making models do what humans want, not just what’s probable
- RLHF: Reinforcement Learning from Human Feedback
- Reward Model: Learns human preferences from comparison data
- PPO: A stable RL algorithm that doesn’t break the base model
- KL Divergence Penalty: Keeps the RL model close to the SFT model (prevents catastrophic forgetting)
- HHH: Helpful, Harmless, Honest — the alignment criteria
- Three-stage pipeline: SFT → RM → PPO
What This Paper Did (Technical Overview)
The three-stage RLHF pipeline:
Stage 1: Supervised Fine-Tuning (SFT)
- Collect human demonstrations: prompts + ideal answers
- Fine-tune GPT-3 on these examples
- Result: SFT model that follows instructions better than base GPT-3
Stage 2: Reward Model (RM)
- Collect human preference comparisons: prompt + output A + output B → human rates which is better
- Train a classifier to predict human preferences
- Bradley-Terry model: models the probability that output A is better than output B
- Result: reward model that mimics human judgment
Stage 3: Reinforcement Learning (RL)
- Use PPO to optimize the SFT model against the reward model
- KL divergence penalty prevents diverging too far from SFT
- Objective: maximize reward while staying close to SFT baseline
- Result: InstructGPT model aligned with human preferences
The Indian Analogy: Three Teachers
Imagine training a brilliant but undisciplined student for an exam:
Stage 1: Learning from a master teacher (SFT)
- A master teacher (human expert) shows the student how to solve problems correctly
- The student learns by imitation: “When you see this type of question, do this”
- Student becomes much better but hasn’t internalized why this is the right way
Stage 2: Learning from a coach (Reward Model)
- Now bring in a coach (not the master teacher, but an assistant)
- The coach watches two attempts at the same problem
- The coach judges: “Attempt A is better because it’s clearer and more efficient”
- The coach rates dozens of pairs of attempts
- The student learns the coach’s judgment pattern
Stage 3: Practicing with feedback (RL)
- The student practices solving problems on their own
- After each attempt, the coach scores it (that’s the reward)
- The student adjusts their strategy to get higher scores from the coach
- But here’s the catch: the coach is imperfect (learned from limited examples)
- So the student doesn’t abandon the original master teacher’s lessons entirely
- The student balances: follow the coach’s guidance, but stay somewhat true to the master’s principles
This balance (via KL divergence penalty) prevents two failures:
- The student forgets the master’s teachings and only optimizes for the coach’s scores
- The student ignores the coach and doesn’t improve
The Math Overview
Stage 1 (SFT): Standard supervised learning
Minimize cross-entropy loss: $$L_{\text{SFT}} = -\log \pi_{\text{SFT}}(y | x)$$
Where $\pi_{\text{SFT}}$ is the SFT model’s probability of generating correct response $y$ given prompt $x$.
Stage 2 (Reward Model): Bradley-Terry preference model
Loss for a single preference comparison (A is better than B): $$L_{\text{RM}} = -\log \sigma(r_\theta(x, y_A) - r_\theta(x, y_B))$$
Where:
- $r_\theta(x, y)$ is the reward for (prompt, response) pair
- $\sigma$ is the sigmoid function
- We want $r(y_A) > r(y_B)$ when A is preferred
Stage 3 (RL): PPO with KL penalty
$$L_{\text{RL}} = -E[r_\theta(x, y)] + \beta \cdot KL[\pi_{\text{RL}}(y|x) || \pi_{\text{SFT}}(y|x)]$$
Where:
- $r_\theta$ is the reward model
- $\beta$ is the KL penalty coefficient (e.g., 0.01)
- The KL term prevents the RL policy from diverging too far from SFT
Key Equations at a Glance
SFT Loss:
L_sft = -log π_sft(y | x)
Reward Model Loss (Bradley-Terry):
L_rm = -log σ(r_θ(x, y_w) - r_θ(x, y_l))
where σ(z) = 1 / (1 + exp(-z)) [sigmoid]
RL Loss (PPO with KL penalty):
L_rl = -r_θ(x, y) + β · KL[π_rl(y|x) || π_sft(y|x)]
KL Divergence:
KL[P || Q] = Σ P(x) log(P(x) / Q(x))
Read in This Order
| Section | What You Will Learn | Difficulty | Time |
|---|---|---|---|
| 01 — Context | Why alignment matters; the gap between capability and alignment | 🟢 beginner | 8 min |
| 02 — The Problem | What’s wrong with training language models on internet text | 🟡 intermediate | 7 min |
| 03 — The Idea | Three-stage RLHF pipeline; intuition behind each stage | 🟡 intermediate | 12 min |
| 04 — The Math | SFT loss, Bradley-Terry reward model, PPO objective with KL penalty | 🔴 advanced | 12 min |
| 05 — Worked Example | Step-by-step trace of reward model training on concrete preference pairs | 🟡 intermediate | 8 min |
| 06 — The Code | Implementing reward model training and PPO-style updates | 🔴 advanced | 10 min |
| 07 — Limitations | When RLHF fails; reward hacking; data requirements | 🟡 intermediate | 6 min |
| 08 — Impact | ChatGPT, Claude, modern alignment methods | 🟡 intermediate | 5 min |
| 09 — Summary | One-sentence recap, key numbers, what came next | 🟢 beginner | 3 min |
Before You Read: Math You Need
- Entropy: How uncertain a distribution is. Entropy
- KL Divergence: How different two distributions are. KL Divergence
- Cross-Entropy Loss: What we minimize in supervised learning. Cross-Entropy Loss
- Sigmoid Function: Maps reals to [0,1]. Brief review in section 04.
- Paper 12 (GPT-3): Understand what we’re aligning. Paper 12: Language Models are Unsupervised Multitask Learners
- Paper 14 (Chain-of-Thought): CoT is used in reward model training. Paper 14: Chain-of-Thought Prompting
Architecture: The RLHF Pipeline
STAGE 1: SUPERVISED FINE-TUNING (SFT)
GPT-3 (175B)
|
[Human demonstrations]
(prompt, ideal_answer) pairs
|
Fine-tune via SFT loss
|
π_SFT: SFT Model
(better at following instructions)
STAGE 2: REWARD MODEL TRAINING
[Human comparisons]
(prompt, response_A, response_B) + label A or B
|
Train classifier: is A better than B?
|
r_θ: Reward Model
(predicts human preference)
STAGE 3: REINFORCEMENT LEARNING (PPO)
π_SFT → π_RL (via policy gradient)
For each rollout:
1. Generate response from π_RL
2. Get reward from r_θ
3. Compute RL gradient: maximize reward - KL penalty
4. Update π_RL via PPO
Result: InstructGPT
(aligned with human preferences, smaller than GPT-3 but better)
ALIGNMENT EFFECT:
GPT-3 (175B): 1.54 rating
InstructGPT (1.3B): 4.53 rating ← 130× smaller, 3× better rated!
Navigation
← Paper 14: Chain-of-Thought Prompting | Paper 16: Let’s Verify Step by Step (Self-Verification) →
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.