The Math: SFT, Reward Model, and PPO with KL

This section covers the mathematical formulation of all three stages. Come back to this section as you code or implement.

Prerequisites: Entropy, KL Divergence, Cross-Entropy Loss

Stage 1: Supervised Fine-Tuning (SFT)

Objective

Minimize the cross-entropy loss on human-written demonstrations:

$$L_{\text{SFT}} = -\sum_{i=1}^{N} \sum_{t=1}^{T_i} \log \pi_{\text{SFT}}(y_{i,t} | x_i, y_{i,1:t-1})$$

Where:

$N$ = number of examples
$x_i$ = prompt $i$
$y_i = [y_{i,1}, y_{i,2}, \ldots, y_{i,T_i}]$ = human-written response (sequence of tokens)
$\pi_{\text{SFT}}$ = SFT model’s probability distribution over next token

Simplified form:

$$L_{\text{SFT}} = -E_{(x,y) \sim D_{\text{demo}}}[\log \pi_{\text{SFT}}(y|x)]$$

Where $D_{\text{demo}}$ is the distribution of human demonstrations.

Interpretation

This is standard language model training. For each token in the human response, we want the model to assign high probability. The loss is the negative log-likelihood (cross-entropy).

Data Requirements

The paper uses:

13k human-written demonstrations
90 contractors
Prompts from various sources (user queries, generative tasks, writing tasks)

Typical cost: ~$0.50 per demonstration (including contractor overhead).

Stage 2: Reward Model (RM) Training

Data Format

Unlike SFT, we don’t need full responses. We collect comparisons:

(x, y_w, y_l)

where:
  x = prompt
  y_w = response preferred by human (winner)
  y_l = response dispreferred by human (loser)

Advantage: Cheaper than writing full responses. A human can compare two responses in ~15 seconds vs. 2 minutes to write one.

Bradley-Terry Model

The RM is a neural network $r_\theta(x, y)$ that outputs a scalar score (logit) for (prompt, response) pair.

Probability that y_w is preferred:

$$P(y_w \text{ preferred} | y_w, y_l, x) = \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$

Where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.

Loss for one comparison:

$$L = -\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$

This is the binary cross-entropy loss applied to the preference classification task.

Batch Loss

For a batch of comparisons:

$$L_{\text{RM}} = -E_{(x, y_w, y_l) \sim D_{\text{comp}}}[\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))]$$

Architecture

The RM is typically the SFT model with a linear head on top:

Take the SFT model up to the last hidden layer
Add a linear layer: hidden_state → scalar
Train this on comparison data

Why reuse SFT? It already understands language. We only need to learn which responses are better.

Data Requirements

The paper uses:

33k human preference comparisons (not unique prompts, but comparisons)
Same 90 contractors
Typical inter-rater agreement: ~70–75%

The fact that raters sometimes disagree is important — it means there’s genuine ambiguity in preferences, and the RM learns a distribution over human judgments.

Stage 3: Reinforcement Learning (RL) with PPO and KL Penalty

Full RL Objective

$$L_{\text{RL}} = -E_{x \sim D_{\text{test}}, y \sim \pi_{\text{RL}}(·|x)}[r_\theta(x, y)] + \beta \cdot KL[\pi_{\text{RL}}(y|x) || \pi_{\text{SFT}}(y|x)]$$

Breaking this down:

Term 1: Reward Maximization $$-E[r_\theta(x, y)]$$

Use policy gradient to increase the probability of high-reward outputs. The negative sign is because we’re minimizing loss (so maximizing reward).

Term 2: KL Divergence Penalty $$\beta \cdot KL[\pi_{\text{RL}}(y|x) || \pi_{\text{SFT}}(y|x)]$$

Constrain the RL policy to stay close to the SFT policy. This prevents:

Reward hacking: Exploiting flaws in the RM
Forgetting: Losing knowledge from pretraining
Distribution shift: Straying too far from the domain the RM was trained on

KL Divergence Expansion

Recall from the math tutorial:

$$KL[P || Q] = \sum_y P(y) \log \frac{P(y)}{Q(y)} = E_{y \sim P}[\log P(y) - \log Q(y)]$$

So the KL term becomes:

$$KL[\pi_{\text{RL}} || \pi_{\text{SFT}}] = E_{y \sim \pi_{\text{RL}}}[\log \pi_{\text{RL}}(y|x) - \log \pi_{\text{SFT}}(y|x)]$$

The RL loss is:

$$L_{\text{RL}} = -E[r_\theta(x, y)] + \beta \left( E[\log \pi_{\text{RL}}(y|x)] - E[\log \pi_{\text{SFT}}(y|x)] \right)$$

The middle term ($E[\log \pi_{\text{RL}}]$) is entropy of the RL policy (entropy term). Higher entropy = more exploring.

The last term ($-\beta E[\log \pi_{\text{SFT}}]$) is a constant (pre-computed from SFT). It doesn’t affect gradients.

Policy Gradient Update (PPO)

PPO (Proximal Policy Optimization) updates the policy to minimize the loss.

For a single rollout from prompt $x$:

$$y \sim \pi_{\text{RL}}(· | x) \text{ (generate response)}$$

$$r = r_\theta(x, y) \text{ (get reward)}$$

$$\text{advantage} = r - V(x) \text{ (subtract baseline to reduce variance)}$$

Where $V(x)$ is a learned baseline (value function).

The policy gradient is:

$$\nabla L_{\text{RL}} \propto \nabla_\theta \left[ r - \beta \log \pi_{\text{SFT}}(y|x) \right] \log \pi_{\text{RL}}(y|x)$$

(This is simplified; PPO uses clipped gradients for stability.)

Hyperparameter: β (KL Coefficient)

The coefficient $\beta$ controls the trade-off:

β = 0: Pure reward-seeking (RL ignores SFT baseline). Model might exploit RM flaws.
β large (e.g., 1.0): Strong KL penalty. Model stays very close to SFT, limited improvement.
β ≈ 0.01–0.1: Sweet spot. Improvement from reward + regularization from KL.

In the paper: β ≈ 0.02 works well empirically.

Practical Concern: Reward Shaping

In practice, RL might optimize a simple reward while ignoring other important aspects. For example:

The model might write longer responses (if longer responses tend to be rated better)
Or use confusing jargon (if that’s in the training data)

Solution: Include multiple reward signals or penalize length:

$$L_{\text{RL}} = -r_\theta(x, y) + \beta \cdot KL[\pi_{\text{RL}} || \pi_{\text{SFT}}] + \alpha \cdot \log(L(y))$$

Where $L(y)$ is response length. The $\alpha$ term penalizes excessively long responses.

Worked Example: Reward Model Loss

Scenario: Training the reward model on a single comparison.

Prompt: "What is 2+2?"

Output A (y_w): "2 + 2 = 4"
Output B (y_l): "2 + 2 = 5"

Human rater: A is better

Suppose the reward model outputs:

$r_\theta(x, y_w) = 2.5$ (logit for correct answer)
$r_\theta(x, y_l) = -1.2$ (logit for wrong answer)

Step 1: Compute the difference $$z = r_\theta(x, y_w) - r_\theta(x, y_l) = 2.5 - (-1.2) = 3.7$$

Step 2: Apply sigmoid $$\sigma(z) = \sigma(3.7) = \frac{1}{1 + e^{-3.7}} = \frac{1}{1 + 0.0247} = \frac{1}{1.0247} \approx 0.9759$$

Step 3: Compute loss $$L = -\log \sigma(z) = -\log(0.9759) \approx 0.0245$$

Interpretation: The loss is small (~0.025) because the model confidently (0.98 probability) predicted the correct preference.

Contrast: If the model predicted wrong

If $r_\theta(x, y_w) = 0.1$ and $r_\theta(x, y_l) = 0.5$:

$$z = 0.1 - 0.5 = -0.4$$

$$\sigma(-0.4) = \frac{1}{1 + e^{0.4}} \approx 0.401$$

$$L = -\log(0.401) \approx 0.911$$

Much higher loss because the model predicted incorrectly (40% confidence in the right preference).

Worked Example: KL Divergence Penalty

Scenario: After RL training, we compute the KL penalty for one prompt.

Prompt: "Explain photosynthesis."

π_SFT distribution over first token:
  P(SFT)("Plants") = 0.6
  P(SFT)("Photosynthesis") = 0.2
  P(SFT)("A") = 0.1
  P(SFT)("The") = 0.1

π_RL distribution over first token (after RL):
  P(RL)("Plants") = 0.7
  P(RL)("Photosynthesis") = 0.1
  P(RL)("A") = 0.1
  P(RL)("The") = 0.1

Compute KL divergence (using natural log):

$$KL = \sum P_{\text{RL}}(w) \log \frac{P_{\text{RL}}(w)}{P_{\text{SFT}}(w)}$$

For each token:

“Plants”: $P_{\text{RL}} = 0.7, P_{\text{SFT}} = 0.6$
- $0.7 \log(0.7/0.6) = 0.7 \times 0.1541 = 0.1079$
“Photosynthesis”: $P_{\text{RL}} = 0.1, P_{\text{SFT}} = 0.2$
- $0.1 \log(0.1/0.2) = 0.1 \times (-0.6931) = -0.0693$
“A”: $P_{\text{RL}} = 0.1, P_{\text{SFT}} = 0.1$
- $0.1 \log(1.0) = 0$
“The”: $P_{\text{RL}} = 0.1, P_{\text{SFT}} = 0.1$
- $0.1 \log(1.0) = 0$

Total: $$KL = 0.1079 - 0.0693 = 0.0386 \text{ nats}$$

In the loss: $$\text{KL penalty} = \beta \times KL = 0.02 \times 0.0386 = 0.000772$$

If the prompt gets reward $r = 3.0$:

$$L_{\text{RL}} = -3.0 + 0.000772 \approx -2.999$$

The KL penalty is small relative to reward, but it accumulates over the sequence and prevents divergence.

Summary: Mathematical Components

Stage	Loss	Key Components
SFT	$-\log \pi_{\text{SFT}}(y\|x)$	Cross-entropy on human responses
RM	$-\log \sigma(r_\theta(x,y_w) - r_\theta(x,y_l))$	Bradley-Terry classification
RL	$-r_\theta + \beta \cdot KL[\pi_{\text{RL}} \|\| \pi_{\text{SFT}}]$	Reward + KL constraint

All three use standard ML techniques, but combined strategically for alignment.

Practical Notes

Scaling: Each stage is relatively cheap to train:
- SFT: ~1 GPU-hour
- RM: ~0.5 GPU-hour
- RL: ~10 GPU-hours (most expensive due to generation and reward computation)
Instability: RL can be unstable (gradients explode). PPO mitigates this via clipping.
Data efficiency: Compared to pure RL from human ratings, this three-stage approach is ~100× more sample-efficient because the RM learns to generalize.
Metric: Human raters prefer InstructGPT (1.3B) over GPT-3 (175B) 72% of the time. This is the paper’s headline result.