Training Language Models to Follow Instructions with Human Feedback

Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelley, Coleman, Zoph, Askell, Picciotto, Herbert-Voss, Engstrom, Olah, Krueger, Felsher, Telleen-Lawton, Conerly, Lanham, Nguyen, Henighan, Kadavath, Joseph, Brown, Clark, Song, Amodei, Sutskever, Christiano, Sam Altman · NeurIPS 2022
Read on arXiv

What This Paper Did

GPT-3 is powerful but misaligned. It follows the probability distribution of the internet, not what users actually want. Ask it to help with something, and it might refuse, ramble, hallucinate, or obey harmful instructions with equal likelihood.

This paper solves the alignment problem using three stages:

Supervised Fine-Tuning (SFT): Train the model on human-written demonstration answers
Reward Model (RM): Train a classifier to predict which outputs humans prefer
Reinforcement Learning (RL): Use the reward model to optimize the policy with PPO (Proximal Policy Optimization)

The result is InstructGPT — a 1.3 billion parameter model that humans prefer over the 175 billion parameter GPT-3, despite being 130× smaller.

This is the origin of ChatGPT, Claude, and all modern aligned LLMs.

Key results:

Model                    | Rating
-------------------------|--------
GPT-3 (175B)            | 1.54 (baseline)
InstructGPT (1.3B)      | 4.53 (best)
InstructGPT vs GPT-3    | Preferred 72% of the time

Metric: 1–5 scale from human raters

The tiny model beats the giant because alignment matters more than raw capability.

Key concepts:

Alignment: Making models do what humans want, not just what’s probable
RLHF: Reinforcement Learning from Human Feedback
Reward Model: Learns human preferences from comparison data
PPO: A stable RL algorithm that doesn’t break the base model
KL Divergence Penalty: Keeps the RL model close to the SFT model (prevents catastrophic forgetting)
HHH: Helpful, Harmless, Honest — the alignment criteria
Three-stage pipeline: SFT → RM → PPO

What This Paper Did (Technical Overview)

The three-stage RLHF pipeline:

Stage 1: Supervised Fine-Tuning (SFT)

Collect human demonstrations: prompts + ideal answers
Fine-tune GPT-3 on these examples
Result: SFT model that follows instructions better than base GPT-3

Stage 2: Reward Model (RM)

Collect human preference comparisons: prompt + output A + output B → human rates which is better
Train a classifier to predict human preferences
Bradley-Terry model: models the probability that output A is better than output B
Result: reward model that mimics human judgment

Stage 3: Reinforcement Learning (RL)

Use PPO to optimize the SFT model against the reward model
KL divergence penalty prevents diverging too far from SFT
Objective: maximize reward while staying close to SFT baseline
Result: InstructGPT model aligned with human preferences

The Indian Analogy: Three Teachers

Imagine training a brilliant but undisciplined student for an exam:

Stage 1: Learning from a master teacher (SFT)

A master teacher (human expert) shows the student how to solve problems correctly
The student learns by imitation: “When you see this type of question, do this”
Student becomes much better but hasn’t internalized why this is the right way

Stage 2: Learning from a coach (Reward Model)

Now bring in a coach (not the master teacher, but an assistant)
The coach watches two attempts at the same problem
The coach judges: “Attempt A is better because it’s clearer and more efficient”
The coach rates dozens of pairs of attempts
The student learns the coach’s judgment pattern

Stage 3: Practicing with feedback (RL)

The student practices solving problems on their own
After each attempt, the coach scores it (that’s the reward)
The student adjusts their strategy to get higher scores from the coach
But here’s the catch: the coach is imperfect (learned from limited examples)
So the student doesn’t abandon the original master teacher’s lessons entirely
The student balances: follow the coach’s guidance, but stay somewhat true to the master’s principles

This balance (via KL divergence penalty) prevents two failures:

The student forgets the master’s teachings and only optimizes for the coach’s scores
The student ignores the coach and doesn’t improve

The Math Overview

Stage 1 (SFT): Standard supervised learning

Minimize cross-entropy loss: $$L_{\text{SFT}} = -\log \pi_{\text{SFT}}(y | x)$$

Where $\pi_{\text{SFT}}$ is the SFT model’s probability of generating correct response $y$ given prompt $x$.

Stage 2 (Reward Model): Bradley-Terry preference model

Loss for a single preference comparison (A is better than B): $$L_{\text{RM}} = -\log \sigma(r_\theta(x, y_A) - r_\theta(x, y_B))$$

Where:

$r_\theta(x, y)$ is the reward for (prompt, response) pair
$\sigma$ is the sigmoid function
We want $r(y_A) > r(y_B)$ when A is preferred

Stage 3 (RL): PPO with KL penalty

$$L_{\text{RL}} = -E[r_\theta(x, y)] + \beta \cdot KL[\pi_{\text{RL}}(y|x) || \pi_{\text{SFT}}(y|x)]$$

Where:

$r_\theta$ is the reward model
$\beta$ is the KL penalty coefficient (e.g., 0.01)
The KL term prevents the RL policy from diverging too far from SFT

Key Equations at a Glance

SFT Loss:
  L_sft = -log π_sft(y | x)

Reward Model Loss (Bradley-Terry):
  L_rm = -log σ(r_θ(x, y_w) - r_θ(x, y_l))
  where σ(z) = 1 / (1 + exp(-z)) [sigmoid]

RL Loss (PPO with KL penalty):
  L_rl = -r_θ(x, y) + β · KL[π_rl(y|x) || π_sft(y|x)]

KL Divergence:
  KL[P || Q] = Σ P(x) log(P(x) / Q(x))

Read in This Order

Section	What You Will Learn	Difficulty	Time
01 — Context	Why alignment matters; the gap between capability and alignment	🟢 beginner	8 min
02 — The Problem	What’s wrong with training language models on internet text	🟡 intermediate	7 min
03 — The Idea	Three-stage RLHF pipeline; intuition behind each stage	🟡 intermediate	12 min
04 — The Math	SFT loss, Bradley-Terry reward model, PPO objective with KL penalty	🔴 advanced	12 min
05 — Worked Example	Step-by-step trace of reward model training on concrete preference pairs	🟡 intermediate	8 min
06 — The Code	Implementing reward model training and PPO-style updates	🔴 advanced	10 min
07 — Limitations	When RLHF fails; reward hacking; data requirements	🟡 intermediate	6 min
08 — Impact	ChatGPT, Claude, modern alignment methods	🟡 intermediate	5 min
09 — Summary	One-sentence recap, key numbers, what came next	🟢 beginner	3 min

Before You Read: Math You Need

Entropy: How uncertain a distribution is. Entropy
KL Divergence: How different two distributions are. KL Divergence
Cross-Entropy Loss: What we minimize in supervised learning. Cross-Entropy Loss
Sigmoid Function: Maps reals to [0,1]. Brief review in section 04.
Paper 12 (GPT-3): Understand what we’re aligning. Paper 12: Language Models are Unsupervised Multitask Learners
Paper 14 (Chain-of-Thought): CoT is used in reward model training. Paper 14: Chain-of-Thought Prompting

Architecture: The RLHF Pipeline

STAGE 1: SUPERVISED FINE-TUNING (SFT)
    GPT-3 (175B)
         |
    [Human demonstrations]
    (prompt, ideal_answer) pairs
         |
    Fine-tune via SFT loss
         |
    π_SFT: SFT Model
    (better at following instructions)


STAGE 2: REWARD MODEL TRAINING
    [Human comparisons]
    (prompt, response_A, response_B) + label A or B
         |
    Train classifier: is A better than B?
         |
    r_θ: Reward Model
    (predicts human preference)


STAGE 3: REINFORCEMENT LEARNING (PPO)
    π_SFT → π_RL (via policy gradient)
    
    For each rollout:
      1. Generate response from π_RL
      2. Get reward from r_θ
      3. Compute RL gradient: maximize reward - KL penalty
      4. Update π_RL via PPO

    Result: InstructGPT
    (aligned with human preferences, smaller than GPT-3 but better)


ALIGNMENT EFFECT:
    GPT-3 (175B):           1.54 rating
    InstructGPT (1.3B):     4.53 rating  ← 130× smaller, 3× better rated!

← Paper 14: Chain-of-Thought Prompting | Paper 16: Let’s Verify Step by Step (Self-Verification) →

Training Language Models to Follow Instructions with Human Feedback

Training Language Models to Follow Instructions with Human Feedback

What This Paper Did

What This Paper Did (Technical Overview)

The Indian Analogy: Three Teachers

The Math Overview

Key Equations at a Glance

Read in This Order

Before You Read: Math You Need

Architecture: The RLHF Pipeline

Navigation

Discussion