The Math: SFT, Reward Model, and PPO with KL
This section covers the mathematical formulation of all three stages. Come back to this section as you code or implement.
Prerequisites: Entropy, KL Divergence, Cross-Entropy Loss
Stage 1: Supervised Fine-Tuning (SFT)
Objective
Minimize the cross-entropy loss on human-written demonstrations:
$$L_{\text{SFT}} = -\sum_{i=1}^{N} \sum_{t=1}^{T_i} \log \pi_{\text{SFT}}(y_{i,t} | x_i, y_{i,1:t-1})$$
Where:
- $N$ = number of examples
- $x_i$ = prompt $i$
- $y_i = [y_{i,1}, y_{i,2}, \ldots, y_{i,T_i}]$ = human-written response (sequence of tokens)
- $\pi_{\text{SFT}}$ = SFT model’s probability distribution over next token
Simplified form:
$$L_{\text{SFT}} = -E_{(x,y) \sim D_{\text{demo}}}[\log \pi_{\text{SFT}}(y|x)]$$
Where $D_{\text{demo}}$ is the distribution of human demonstrations.
Interpretation
This is standard language model training. For each token in the human response, we want the model to assign high probability. The loss is the negative log-likelihood (cross-entropy).
Data Requirements
The paper uses:
- 13k human-written demonstrations
- 90 contractors
- Prompts from various sources (user queries, generative tasks, writing tasks)
Typical cost: ~$0.50 per demonstration (including contractor overhead).
Stage 2: Reward Model (RM) Training
Data Format
Unlike SFT, we don’t need full responses. We collect comparisons:
(x, y_w, y_l)
where:
x = prompt
y_w = response preferred by human (winner)
y_l = response dispreferred by human (loser)
Advantage: Cheaper than writing full responses. A human can compare two responses in ~15 seconds vs. 2 minutes to write one.
Bradley-Terry Model
The RM is a neural network $r_\theta(x, y)$ that outputs a scalar score (logit) for (prompt, response) pair.
Probability that y_w is preferred:
$$P(y_w \text{ preferred} | y_w, y_l, x) = \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$
Where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.
Loss for one comparison:
$$L = -\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$
This is the binary cross-entropy loss applied to the preference classification task.
Batch Loss
For a batch of comparisons:
$$L_{\text{RM}} = -E_{(x, y_w, y_l) \sim D_{\text{comp}}}[\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))]$$
Architecture
The RM is typically the SFT model with a linear head on top:
- Take the SFT model up to the last hidden layer
- Add a linear layer:
hidden_state → scalar - Train this on comparison data
Why reuse SFT? It already understands language. We only need to learn which responses are better.
Data Requirements
The paper uses:
- 33k human preference comparisons (not unique prompts, but comparisons)
- Same 90 contractors
- Typical inter-rater agreement: ~70–75%
The fact that raters sometimes disagree is important — it means there’s genuine ambiguity in preferences, and the RM learns a distribution over human judgments.
Stage 3: Reinforcement Learning (RL) with PPO and KL Penalty
Full RL Objective
$$L_{\text{RL}} = -E_{x \sim D_{\text{test}}, y \sim \pi_{\text{RL}}(·|x)}[r_\theta(x, y)] + \beta \cdot KL[\pi_{\text{RL}}(y|x) || \pi_{\text{SFT}}(y|x)]$$
Breaking this down:
Term 1: Reward Maximization $$-E[r_\theta(x, y)]$$
Use policy gradient to increase the probability of high-reward outputs. The negative sign is because we’re minimizing loss (so maximizing reward).
Term 2: KL Divergence Penalty $$\beta \cdot KL[\pi_{\text{RL}}(y|x) || \pi_{\text{SFT}}(y|x)]$$
Constrain the RL policy to stay close to the SFT policy. This prevents:
- Reward hacking: Exploiting flaws in the RM
- Forgetting: Losing knowledge from pretraining
- Distribution shift: Straying too far from the domain the RM was trained on
KL Divergence Expansion
Recall from the math tutorial:
$$KL[P || Q] = \sum_y P(y) \log \frac{P(y)}{Q(y)} = E_{y \sim P}[\log P(y) - \log Q(y)]$$
So the KL term becomes:
$$KL[\pi_{\text{RL}} || \pi_{\text{SFT}}] = E_{y \sim \pi_{\text{RL}}}[\log \pi_{\text{RL}}(y|x) - \log \pi_{\text{SFT}}(y|x)]$$
The RL loss is:
$$L_{\text{RL}} = -E[r_\theta(x, y)] + \beta \left( E[\log \pi_{\text{RL}}(y|x)] - E[\log \pi_{\text{SFT}}(y|x)] \right)$$
The middle term ($E[\log \pi_{\text{RL}}]$) is entropy of the RL policy (entropy term). Higher entropy = more exploring.
The last term ($-\beta E[\log \pi_{\text{SFT}}]$) is a constant (pre-computed from SFT). It doesn’t affect gradients.
Policy Gradient Update (PPO)
PPO (Proximal Policy Optimization) updates the policy to minimize the loss.
For a single rollout from prompt $x$:
$$y \sim \pi_{\text{RL}}(· | x) \text{ (generate response)}$$
$$r = r_\theta(x, y) \text{ (get reward)}$$
$$\text{advantage} = r - V(x) \text{ (subtract baseline to reduce variance)}$$
Where $V(x)$ is a learned baseline (value function).
The policy gradient is:
$$\nabla L_{\text{RL}} \propto \nabla_\theta \left[ r - \beta \log \pi_{\text{SFT}}(y|x) \right] \log \pi_{\text{RL}}(y|x)$$
(This is simplified; PPO uses clipped gradients for stability.)
Hyperparameter: β (KL Coefficient)
The coefficient $\beta$ controls the trade-off:
- β = 0: Pure reward-seeking (RL ignores SFT baseline). Model might exploit RM flaws.
- β large (e.g., 1.0): Strong KL penalty. Model stays very close to SFT, limited improvement.
- β ≈ 0.01–0.1: Sweet spot. Improvement from reward + regularization from KL.
In the paper: β ≈ 0.02 works well empirically.
Practical Concern: Reward Shaping
In practice, RL might optimize a simple reward while ignoring other important aspects. For example:
- The model might write longer responses (if longer responses tend to be rated better)
- Or use confusing jargon (if that’s in the training data)
Solution: Include multiple reward signals or penalize length:
$$L_{\text{RL}} = -r_\theta(x, y) + \beta \cdot KL[\pi_{\text{RL}} || \pi_{\text{SFT}}] + \alpha \cdot \log(L(y))$$
Where $L(y)$ is response length. The $\alpha$ term penalizes excessively long responses.
Worked Example: Reward Model Loss
Scenario: Training the reward model on a single comparison.
Prompt: "What is 2+2?"
Output A (y_w): "2 + 2 = 4"
Output B (y_l): "2 + 2 = 5"
Human rater: A is better
Suppose the reward model outputs:
- $r_\theta(x, y_w) = 2.5$ (logit for correct answer)
- $r_\theta(x, y_l) = -1.2$ (logit for wrong answer)
Step 1: Compute the difference $$z = r_\theta(x, y_w) - r_\theta(x, y_l) = 2.5 - (-1.2) = 3.7$$
Step 2: Apply sigmoid $$\sigma(z) = \sigma(3.7) = \frac{1}{1 + e^{-3.7}} = \frac{1}{1 + 0.0247} = \frac{1}{1.0247} \approx 0.9759$$
Step 3: Compute loss $$L = -\log \sigma(z) = -\log(0.9759) \approx 0.0245$$
Interpretation: The loss is small (~0.025) because the model confidently (0.98 probability) predicted the correct preference.
Contrast: If the model predicted wrong
If $r_\theta(x, y_w) = 0.1$ and $r_\theta(x, y_l) = 0.5$:
$$z = 0.1 - 0.5 = -0.4$$
$$\sigma(-0.4) = \frac{1}{1 + e^{0.4}} \approx 0.401$$
$$L = -\log(0.401) \approx 0.911$$
Much higher loss because the model predicted incorrectly (40% confidence in the right preference).
Worked Example: KL Divergence Penalty
Scenario: After RL training, we compute the KL penalty for one prompt.
Prompt: "Explain photosynthesis."
π_SFT distribution over first token:
P(SFT)("Plants") = 0.6
P(SFT)("Photosynthesis") = 0.2
P(SFT)("A") = 0.1
P(SFT)("The") = 0.1
π_RL distribution over first token (after RL):
P(RL)("Plants") = 0.7
P(RL)("Photosynthesis") = 0.1
P(RL)("A") = 0.1
P(RL)("The") = 0.1
Compute KL divergence (using natural log):
$$KL = \sum P_{\text{RL}}(w) \log \frac{P_{\text{RL}}(w)}{P_{\text{SFT}}(w)}$$
For each token:
-
“Plants”: $P_{\text{RL}} = 0.7, P_{\text{SFT}} = 0.6$
- $0.7 \log(0.7/0.6) = 0.7 \times 0.1541 = 0.1079$
-
“Photosynthesis”: $P_{\text{RL}} = 0.1, P_{\text{SFT}} = 0.2$
- $0.1 \log(0.1/0.2) = 0.1 \times (-0.6931) = -0.0693$
-
“A”: $P_{\text{RL}} = 0.1, P_{\text{SFT}} = 0.1$
- $0.1 \log(1.0) = 0$
-
“The”: $P_{\text{RL}} = 0.1, P_{\text{SFT}} = 0.1$
- $0.1 \log(1.0) = 0$
Total: $$KL = 0.1079 - 0.0693 = 0.0386 \text{ nats}$$
In the loss: $$\text{KL penalty} = \beta \times KL = 0.02 \times 0.0386 = 0.000772$$
If the prompt gets reward $r = 3.0$:
$$L_{\text{RL}} = -3.0 + 0.000772 \approx -2.999$$
The KL penalty is small relative to reward, but it accumulates over the sequence and prevents divergence.
Summary: Mathematical Components
| Stage | Loss | Key Components |
|---|---|---|
| SFT | $-\log \pi_{\text{SFT}}(y|x)$ | Cross-entropy on human responses |
| RM | $-\log \sigma(r_\theta(x,y_w) - r_\theta(x,y_l))$ | Bradley-Terry classification |
| RL | $-r_\theta + \beta \cdot KL[\pi_{\text{RL}} || \pi_{\text{SFT}}]$ | Reward + KL constraint |
All three use standard ML techniques, but combined strategically for alignment.
Practical Notes
-
Scaling: Each stage is relatively cheap to train:
- SFT: ~1 GPU-hour
- RM: ~0.5 GPU-hour
- RL: ~10 GPU-hours (most expensive due to generation and reward computation)
-
Instability: RL can be unstable (gradients explode). PPO mitigates this via clipping.
-
Data efficiency: Compared to pure RL from human ratings, this three-stage approach is ~100× more sample-efficient because the RM learns to generalize.
-
Metric: Human raters prefer InstructGPT (1.3B) over GPT-3 (175B) 72% of the time. This is the paper’s headline result.