Worked Example: Training the Reward Model
Let’s trace through one iteration of reward model training from start to finish, using real numbers.
Setting: Collecting Human Comparisons
We have a prompt and two model-generated responses. A human rater has judged which is better.
Prompt (x):
"Explain why the sky is blue to a 5-year-old."
Output A (y_A) — from a less-trained model:
"The sky is blue because of Rayleigh scattering. Shorter wavelengths
of light scatter more, and blue has a shorter wavelength than red."
Output B (y_B) — from a better model:
"Have you ever played with a prism or looked through a blue marble?
When sunlight goes through the air, something special happens! The
tiny bits of air bounce the blue light more than the red light, so
we see more blue. That's why the sky is blue!"
Human rater judgment: B is better (more age-appropriate, clearer analogy)
Training the Reward Model: Step-by-Step
Step 1: Forward Pass Through the Model
The RM is a neural network. It takes (prompt, response) and outputs a logit (unnormalized score).
Let’s say the RM gives:
r_θ(x, y_A) = 0.8 (not great, uses too much jargon)
r_θ(x, y_B) = 2.1 (better, age-appropriate)
These are logits — not probabilities yet.
Step 2: Compute the Difference
In Bradley-Terry, we care about the relative ranking:
$$z = r_\theta(x, y_B) - r_\theta(x, y_A) = 2.1 - 0.8 = 1.3$$
A positive $z$ means the model predicted B is better, which matches the human judgment.
Step 3: Apply Sigmoid to Get Probability
The sigmoid function maps the difference to a probability:
$$P(B \text{ preferred}) = \sigma(z) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-1.3}}$$
Calculate: $$e^{-1.3} = 0.2725$$
$$\sigma(1.3) = \frac{1}{1 + 0.2725} = \frac{1}{1.2725} \approx 0.7856$$
Interpretation: The model is 78.56% confident that B is better. Since B actually is better (according to human), this is good.
Step 4: Compute Cross-Entropy Loss
Since B was judged better, the loss is:
$$L = -\log P(\text{B preferred}) = -\log(0.7856) \approx 0.2416$$
Interpretation: The loss is moderate. The model got it right, but wasn’t fully confident. If the model were more confident (say, 0.95), loss would be -log(0.95) ≈ 0.051 (lower). If the model predicted wrong (say, 0.3), loss would be -log(0.3) ≈ 1.204 (much higher).
Step 5: Backpropagation (Implicit)
In training, we compute gradients and update the model weights:
$$\theta \leftarrow \theta - \eta \nabla_\theta L$$
Where $\eta$ is the learning rate (e.g., 1e-5).
The gradient points in the direction that would increase $P(\text{B preferred})$ — i.e., make $r(y_B)$ higher and/or $r(y_A)$ lower.
Full Batch: Multiple Comparisons
In practice, we train on batches of comparison pairs. Let’s see three examples in one batch:
Example 1 (from above)
Prompt: "Explain why the sky is blue to a 5-year-old."
y_A: Too technical answer
y_B: Age-appropriate answer
Human: B is better
Reward Model output:
r_A = 0.8, r_B = 2.1
z = 1.3, σ(z) = 0.7856
L₁ = -log(0.7856) = 0.2416
Example 2
Prompt: "What is the capital of France?"
y_A: "Paris. It is the largest city in France."
y_B: "Paris"
Human: A is better (more helpful context)
Reward Model output:
r_A = 1.8, r_B = 1.2
z = 1.2 - 1.8 = -0.6
σ(-0.6) = 1/(1 + e^0.6) = 1/1.8221 ≈ 0.5487
L₂ = -log(0.5487) ≈ 0.6006
Higher loss because the model got the ranking backwards (predicted B better when A is better).
Example 3
Prompt: "Tell me a fun fact about penguins."
y_A: "Penguins can swim up to 22 mph and dive 500+ meters deep."
y_B: "Did you know penguins can hold their breath for 6 minutes? They are
amazing swimmers and can dive deeper than most people!"
Human: B is better (more engaging, better pacing)
Reward Model output:
r_A = 1.5, r_B = 2.5
z = 2.5 - 1.5 = 1.0
σ(1.0) = 1/(1 + e^-1) = 1/1.3679 ≈ 0.7311
L₃ = -log(0.7311) ≈ 0.3133
Batch Loss
Average loss over the three examples:
$$L_{\text{batch}} = \frac{1}{3}(L_1 + L_2 + L_3) = \frac{1}{3}(0.2416 + 0.6006 + 0.3133) = \frac{1.1555}{3} \approx 0.3852$$
Interpreting the Loss Values
- L₁ = 0.2416: Model got it right (78.5% confidence). Good.
- L₂ = 0.6006: Model got it wrong (54.9% confidence, barely above random). Training will focus on this one more.
- L₃ = 0.3133: Model got it right but not very confident (73.1%). Could be better.
The backprop gradient will be largest for L₂ (wrong prediction), so this example will exert more influence on weight updates.
What the Model Learns
Over thousands of training examples, the reward model learns:
- To value clarity: Age-appropriate explanations get higher scores
- To value completeness: Answers with context score better than minimal answers
- To value engagement: Interactive, interesting phrasings score higher
- To penalize wordiness: Overly long explanations score lower
- To handle trade-offs: Sometimes short is better (capital question), sometimes longer is better (fun fact)
The model internalizes these preferences via the loss gradients.
Why Bradley-Terry Instead of Regression?
An alternative would be regression: ask humans to rate A and B separately on a 1-5 scale, then train the RM to predict those ratings.
Why Bradley-Terry (relative ranking) is better:
-
Easier annotation: Comparing two things is easier than assigning absolute ratings. Humans are better at “A vs. B” than “What number is this?”
-
Less ambiguous: A 3-star response is ambiguous. Does it mean “good” or “mediocre”? With comparisons, the judgment is clear.
-
More data-efficient: You can generate many (y_A, y_B) pairs from fewer unique responses by comparing different pairs.
-
Natural human judgment: Humans naturally compare. Absolute ratings require calibration across raters.
In the paper, the RM achieves ~73% inter-rater agreement with humans on held-out test set. Not perfect, but good enough to guide RL.
After Training: Scoring New Responses
Once trained, the RM can score any (prompt, response) pair, even ones it never saw during training.
New Prompt: "How do birds fly?"
New Response (from the RL policy):
"Birds have special bones that are hollow and light. Their wings have
feathers arranged to push air down, creating lift. Flying is controlled
by powerful flight muscles. Pretty cool!"
RM scores it:
r_θ(x, y_new) = 1.7
In the context of training, this might score lower than a perfect response
(which might get r = 2.5), but higher than a poor response (which gets r = -0.5).
This score becomes the reward signal for the RL stage.
Key Takeaway
The RM training is:
- Simple: Binary classification on preferences
- Efficient: Trains fast, requires ~0.5 GPU hours
- Generalizable: Learns patterns that transfer to new (prompt, response) pairs
- Scalable: Can rank any number of responses without retraining
This is the bridge between human feedback (which is expensive) and RL (which can leverage a learned reward function).