Summary: RLHF and the Birth of ChatGPT
One-Sentence Version
Reinforcement Learning from Human Feedback (RLHF) uses a three-stage pipeline — supervised fine-tuning, reward model training, and policy gradient optimization — to align language models with human preferences, making a 1.3B InstructGPT preferable to the 175B GPT-3 despite being 130× smaller.
The Problem
Large language models like GPT-3 are capable but misaligned. They follow internet text distributions, which include helpful, harmful, honest, and dishonest content equally. They don’t know what humans actually want.
The Idea
Three-stage pipeline:
- SFT (Supervised Fine-Tuning): Fine-tune GPT-3 on human-written demonstrations
- Reward Model (RM): Train a classifier to predict human preferences on (prompt, response) pairs using Bradley-Terry loss
- RL with PPO: Optimize the policy to maximize reward while staying close to the SFT model (KL penalty)
The Math
SFT Loss: $$L_{\text{SFT}} = -E[\log \pi_{\text{SFT}}(y|x)]$$
Reward Model Loss (Bradley-Terry): $$L_{\text{RM}} = -E[\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))]$$
RL Objective (PPO with KL): $$L_{\text{RL}} = -E[r_\theta(x, y)] + \beta \cdot KL[\pi_{\text{RL}} || \pi_{\text{SFT}}]$$
The KL term prevents the RL policy from diverging too far from the SFT baseline.
Key Results
| Model | Size | Rating | Human Preference |
|---|---|---|---|
| GPT-3 | 175B | 1.54 | Baseline |
| InstructGPT | 1.3B | 4.53 | 72% preferred |
Headline: A 1.3B model beats a 175B model when properly aligned.
The Indian Analogy
Three stages of learning:
- Learning from a master teacher: Watch how the master solves problems (SFT)
- Learning from a coach: A coach rates your attempts and judges which are better (RM)
- Self-improvement with feedback: Practice and adjust based on the coach’s ratings, but don’t forget the master’s principles (RL with KL)
The balance between following the coach and remembering the master is crucial.
Key Numbers
- 13,000 human demonstrations for SFT
- 33,000 preference comparisons for RM training
- 1.3B parameters for InstructGPT (tiny compared to GPT-3)
- 72% of humans preferred InstructGPT over GPT-3
- 4.53 rating vs. 1.54 (nearly 3× improvement)
- β ≈ 0.02 for KL coefficient (sweet spot in practice)
What Came Before: Context
- Paper 12 (GPT-3): The base model, powerful but misaligned
- Paper 14 (Chain-of-Thought): Reasoning emerges at scale; influences reward model design
- Previous alignment work: Learning from preferences (Christiano et al., 2017)
What Came Next
- ChatGPT (Nov 2022): Deployed InstructGPT to the world
- Constitutional AI (2023): Uses LLM feedback instead of human feedback
- DPO (2023): Removes the separate reward model
- ORPO (2024): Even simpler optimization
- Reasoning models (2024–2025): o1, R1 use test-time compute for reasoning
The basic insight (learn from preferences, optimize with RL) is here to stay. Implementation details improve yearly.
Limitations
- ✗ Reward hacking (gaming the reward model)
- ✗ Human rater inconsistency (only 73% agreement)
- ✗ Distributional shift (RM unreliable out-of-distribution)
- ✗ Unfaithful reasoning (explanations don’t match computation)
- ✗ Data requirements (expensive to scale)
- ✗ KL tuning (β is hard to choose)
- ✗ Knowledge loss (can forget pretraining)
Why This Paper Matters
Before: “Alignment requires new architectures, symbolic AI, or different training objectives.”
After: “Alignment is learnable. Use preference learning + RL + KL penalty.”
This paper:
- Made alignment practical at scale
- Enabled ChatGPT (the product that changed everything)
- Showed that smaller + aligned > larger + unaligned
- Created a template followed by every modern LLM
In 2025: RLHF and its variants are in every deployed LLM. This paper is foundational.
Glossary
RLHF (Reinforcement Learning from Human Feedback): Training language models using RL with human preferences as the reward signal.
Bradley-Terry Model: A ranking model that predicts the probability one output is preferred over another, based on logit differences.
Reward Model (RM): A neural network trained to predict human preferences on (prompt, response) pairs.
KL Divergence Penalty: A regularization term that keeps the RL policy close to the SFT policy, preventing excessive divergence.
PPO (Proximal Policy Optimization): A stable RL algorithm that clips gradients to prevent overshooting policy updates.
Alignment: Making models do what humans want, not just what’s probable in internet text.
Supervised Fine-Tuning (SFT): Fine-tuning a pretrained model on human-labeled examples using standard cross-entropy loss.
Instruction-Following: The capability to follow user instructions accurately and reliably.
Human Preference: What humans actually want, captured via comparisons or ratings.
Policy: The language model’s probability distribution over responses given a prompt.
Further Reading
Original Paper: Training Language Models to Follow Instructions with Human Feedback — Ouyang et al., NeurIPS 2022
Key Follow-Ups:
- Constitutional AI (Bai et al., 2023) — LLM feedback instead of human
- DPO (Rafailov et al., 2023) — Direct preference optimization
- ORPO (Hong et al., 2024) — Simpler, more stable
Related Papers:
- Learning from Human Preferences — Christiano et al., ICML 2017 (groundwork)
- Fine-Tuning Language Models from Human Preferences — Ziegler et al., arXiv 2019 (earlier application)
Blog Resources:
- OpenAI’s blog on InstructGPT and ChatGPT
- Anthropic’s blog on Constitutional AI
- HuggingFace’s TRL library (RLHF implementation)
Code:
- OpenAI TL;DR summarization with RL
- HuggingFace TRL — Production-ready RLHF
- Anthropic’s Constitutional AI code
What to Read Next
Continue the alignment journey:
- Constitutional AI: Harmlessness from AI Feedback — RLAIF, the next iteration
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Simpler than RLHF
- Paper 16 (if available): Self-verification and scaling supervision
- DeepSeek R1 / OpenAI o1 papers — Reasoning-focused alignment
Practical implementation:
Key Takeaway
Alignment is learnable. With preference data and the right objective, we can train models to be helpful, harmless, and honest. This insight — demonstrated by this paper — changed AI forever.
Every time you use ChatGPT, Claude, or any aligned LLM, you’re using the technique from this paper.
Navigation
← Paper 14: Chain-of-Thought Prompting | Paper 16: Let’s Verify Step by Step (Self-Verification) →