Section 09

Summary: RLHF and the Birth of ChatGPT

Training Language Models to Follow Instructions with Human Feedback 2022

Summary: RLHF and the Birth of ChatGPT

One-Sentence Version

Reinforcement Learning from Human Feedback (RLHF) uses a three-stage pipeline — supervised fine-tuning, reward model training, and policy gradient optimization — to align language models with human preferences, making a 1.3B InstructGPT preferable to the 175B GPT-3 despite being 130× smaller.


The Problem

Large language models like GPT-3 are capable but misaligned. They follow internet text distributions, which include helpful, harmful, honest, and dishonest content equally. They don’t know what humans actually want.


The Idea

Three-stage pipeline:

  1. SFT (Supervised Fine-Tuning): Fine-tune GPT-3 on human-written demonstrations
  2. Reward Model (RM): Train a classifier to predict human preferences on (prompt, response) pairs using Bradley-Terry loss
  3. RL with PPO: Optimize the policy to maximize reward while staying close to the SFT model (KL penalty)

The Math

SFT Loss: $$L_{\text{SFT}} = -E[\log \pi_{\text{SFT}}(y|x)]$$

Reward Model Loss (Bradley-Terry): $$L_{\text{RM}} = -E[\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))]$$

RL Objective (PPO with KL): $$L_{\text{RL}} = -E[r_\theta(x, y)] + \beta \cdot KL[\pi_{\text{RL}} || \pi_{\text{SFT}}]$$

The KL term prevents the RL policy from diverging too far from the SFT baseline.


Key Results

ModelSizeRatingHuman Preference
GPT-3175B1.54Baseline
InstructGPT1.3B4.5372% preferred

Headline: A 1.3B model beats a 175B model when properly aligned.


The Indian Analogy

Three stages of learning:

  1. Learning from a master teacher: Watch how the master solves problems (SFT)
  2. Learning from a coach: A coach rates your attempts and judges which are better (RM)
  3. Self-improvement with feedback: Practice and adjust based on the coach’s ratings, but don’t forget the master’s principles (RL with KL)

The balance between following the coach and remembering the master is crucial.


Key Numbers

  • 13,000 human demonstrations for SFT
  • 33,000 preference comparisons for RM training
  • 1.3B parameters for InstructGPT (tiny compared to GPT-3)
  • 72% of humans preferred InstructGPT over GPT-3
  • 4.53 rating vs. 1.54 (nearly 3× improvement)
  • β ≈ 0.02 for KL coefficient (sweet spot in practice)

What Came Before: Context

  • Paper 12 (GPT-3): The base model, powerful but misaligned
  • Paper 14 (Chain-of-Thought): Reasoning emerges at scale; influences reward model design
  • Previous alignment work: Learning from preferences (Christiano et al., 2017)

What Came Next

  1. ChatGPT (Nov 2022): Deployed InstructGPT to the world
  2. Constitutional AI (2023): Uses LLM feedback instead of human feedback
  3. DPO (2023): Removes the separate reward model
  4. ORPO (2024): Even simpler optimization
  5. Reasoning models (2024–2025): o1, R1 use test-time compute for reasoning

The basic insight (learn from preferences, optimize with RL) is here to stay. Implementation details improve yearly.


Limitations

  • ✗ Reward hacking (gaming the reward model)
  • ✗ Human rater inconsistency (only 73% agreement)
  • ✗ Distributional shift (RM unreliable out-of-distribution)
  • ✗ Unfaithful reasoning (explanations don’t match computation)
  • ✗ Data requirements (expensive to scale)
  • ✗ KL tuning (β is hard to choose)
  • ✗ Knowledge loss (can forget pretraining)

Why This Paper Matters

Before: “Alignment requires new architectures, symbolic AI, or different training objectives.”

After: “Alignment is learnable. Use preference learning + RL + KL penalty.”

This paper:

  1. Made alignment practical at scale
  2. Enabled ChatGPT (the product that changed everything)
  3. Showed that smaller + aligned > larger + unaligned
  4. Created a template followed by every modern LLM

In 2025: RLHF and its variants are in every deployed LLM. This paper is foundational.


Glossary

RLHF (Reinforcement Learning from Human Feedback): Training language models using RL with human preferences as the reward signal.

Bradley-Terry Model: A ranking model that predicts the probability one output is preferred over another, based on logit differences.

Reward Model (RM): A neural network trained to predict human preferences on (prompt, response) pairs.

KL Divergence Penalty: A regularization term that keeps the RL policy close to the SFT policy, preventing excessive divergence.

PPO (Proximal Policy Optimization): A stable RL algorithm that clips gradients to prevent overshooting policy updates.

Alignment: Making models do what humans want, not just what’s probable in internet text.

Supervised Fine-Tuning (SFT): Fine-tuning a pretrained model on human-labeled examples using standard cross-entropy loss.

Instruction-Following: The capability to follow user instructions accurately and reliably.

Human Preference: What humans actually want, captured via comparisons or ratings.

Policy: The language model’s probability distribution over responses given a prompt.


Further Reading

Original Paper: Training Language Models to Follow Instructions with Human Feedback — Ouyang et al., NeurIPS 2022

Key Follow-Ups:

  • Constitutional AI (Bai et al., 2023) — LLM feedback instead of human
  • DPO (Rafailov et al., 2023) — Direct preference optimization
  • ORPO (Hong et al., 2024) — Simpler, more stable

Related Papers:

Blog Resources:

  • OpenAI’s blog on InstructGPT and ChatGPT
  • Anthropic’s blog on Constitutional AI
  • HuggingFace’s TRL library (RLHF implementation)

Code:


Continue the alignment journey:

  1. Constitutional AI: Harmlessness from AI Feedback — RLAIF, the next iteration
  2. Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Simpler than RLHF
  3. Paper 16 (if available): Self-verification and scaling supervision
  4. DeepSeek R1 / OpenAI o1 papers — Reasoning-focused alignment

Practical implementation:


Key Takeaway

Alignment is learnable. With preference data and the right objective, we can train models to be helpful, harmless, and honest. This insight — demonstrated by this paper — changed AI forever.

Every time you use ChatGPT, Claude, or any aligned LLM, you’re using the technique from this paper.


Navigation

Paper 14: Chain-of-Thought Prompting | Paper 16: Let’s Verify Step by Step (Self-Verification)

🎉 You've finished this paper!