Summary: RLHF and the Birth of ChatGPT

One-Sentence Version

Reinforcement Learning from Human Feedback (RLHF) uses a three-stage pipeline — supervised fine-tuning, reward model training, and policy gradient optimization — to align language models with human preferences, making a 1.3B InstructGPT preferable to the 175B GPT-3 despite being 130× smaller.

The Problem

Large language models like GPT-3 are capable but misaligned. They follow internet text distributions, which include helpful, harmful, honest, and dishonest content equally. They don’t know what humans actually want.

The Idea

Three-stage pipeline:

SFT (Supervised Fine-Tuning): Fine-tune GPT-3 on human-written demonstrations
Reward Model (RM): Train a classifier to predict human preferences on (prompt, response) pairs using Bradley-Terry loss
RL with PPO: Optimize the policy to maximize reward while staying close to the SFT model (KL penalty)

The Math

SFT Loss: $$L_{\text{SFT}} = -E[\log \pi_{\text{SFT}}(y|x)]$$

Reward Model Loss (Bradley-Terry): $$L_{\text{RM}} = -E[\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))]$$

RL Objective (PPO with KL): $$L_{\text{RL}} = -E[r_\theta(x, y)] + \beta \cdot KL[\pi_{\text{RL}} || \pi_{\text{SFT}}]$$

The KL term prevents the RL policy from diverging too far from the SFT baseline.

Key Results

Model	Size	Rating	Human Preference
GPT-3	175B	1.54	Baseline
InstructGPT	1.3B	4.53	72% preferred

Headline: A 1.3B model beats a 175B model when properly aligned.

The Indian Analogy

Three stages of learning:

Learning from a master teacher: Watch how the master solves problems (SFT)
Learning from a coach: A coach rates your attempts and judges which are better (RM)
Self-improvement with feedback: Practice and adjust based on the coach’s ratings, but don’t forget the master’s principles (RL with KL)

The balance between following the coach and remembering the master is crucial.

Key Numbers

13,000 human demonstrations for SFT
33,000 preference comparisons for RM training
1.3B parameters for InstructGPT (tiny compared to GPT-3)
72% of humans preferred InstructGPT over GPT-3
4.53 rating vs. 1.54 (nearly 3× improvement)
β ≈ 0.02 for KL coefficient (sweet spot in practice)

What Came Before: Context

Paper 12 (GPT-3): The base model, powerful but misaligned
Paper 14 (Chain-of-Thought): Reasoning emerges at scale; influences reward model design
Previous alignment work: Learning from preferences (Christiano et al., 2017)

What Came Next

ChatGPT (Nov 2022): Deployed InstructGPT to the world
Constitutional AI (2023): Uses LLM feedback instead of human feedback
DPO (2023): Removes the separate reward model
ORPO (2024): Even simpler optimization
Reasoning models (2024–2025): o1, R1 use test-time compute for reasoning

The basic insight (learn from preferences, optimize with RL) is here to stay. Implementation details improve yearly.

Limitations

✗ Reward hacking (gaming the reward model)
✗ Human rater inconsistency (only 73% agreement)
✗ Distributional shift (RM unreliable out-of-distribution)
✗ Unfaithful reasoning (explanations don’t match computation)
✗ Data requirements (expensive to scale)
✗ KL tuning (β is hard to choose)
✗ Knowledge loss (can forget pretraining)

Why This Paper Matters

Before: “Alignment requires new architectures, symbolic AI, or different training objectives.”

After: “Alignment is learnable. Use preference learning + RL + KL penalty.”

This paper:

Made alignment practical at scale
Enabled ChatGPT (the product that changed everything)
Showed that smaller + aligned > larger + unaligned
Created a template followed by every modern LLM

In 2025: RLHF and its variants are in every deployed LLM. This paper is foundational.

Glossary

RLHF (Reinforcement Learning from Human Feedback): Training language models using RL with human preferences as the reward signal.

Bradley-Terry Model: A ranking model that predicts the probability one output is preferred over another, based on logit differences.

Reward Model (RM): A neural network trained to predict human preferences on (prompt, response) pairs.

KL Divergence Penalty: A regularization term that keeps the RL policy close to the SFT policy, preventing excessive divergence.

PPO (Proximal Policy Optimization): A stable RL algorithm that clips gradients to prevent overshooting policy updates.

Alignment: Making models do what humans want, not just what’s probable in internet text.

Supervised Fine-Tuning (SFT): Fine-tuning a pretrained model on human-labeled examples using standard cross-entropy loss.

Instruction-Following: The capability to follow user instructions accurately and reliably.

Human Preference: What humans actually want, captured via comparisons or ratings.

Policy: The language model’s probability distribution over responses given a prompt.

Key Takeaway

Alignment is learnable. With preference data and the right objective, we can train models to be helpful, harmless, and honest. This insight — demonstrated by this paper — changed AI forever.

Every time you use ChatGPT, Claude, or any aligned LLM, you’re using the technique from this paper.

Navigation

← Paper 14: Chain-of-Thought Prompting | Paper 16: Let’s Verify Step by Step (Self-Verification) →

Summary: RLHF and the Birth of ChatGPT

Summary: RLHF and the Birth of ChatGPT

One-Sentence Version

The Problem

The Idea

The Math

Key Results

The Indian Analogy

Key Numbers

What Came Before: Context

What Came Next

Limitations

Why This Paper Matters

Glossary

Further Reading

What to Read Next

Key Takeaway