Worked Example: Training the Reward Model

Let’s trace through one iteration of reward model training from start to finish, using real numbers.

Setting: Collecting Human Comparisons

We have a prompt and two model-generated responses. A human rater has judged which is better.

Prompt (x):
  "Explain why the sky is blue to a 5-year-old."

Output A (y_A) — from a less-trained model:
  "The sky is blue because of Rayleigh scattering. Shorter wavelengths
   of light scatter more, and blue has a shorter wavelength than red."

Output B (y_B) — from a better model:
  "Have you ever played with a prism or looked through a blue marble?
   When sunlight goes through the air, something special happens! The
   tiny bits of air bounce the blue light more than the red light, so
   we see more blue. That's why the sky is blue!"

Human rater judgment: B is better (more age-appropriate, clearer analogy)

Training the Reward Model: Step-by-Step

Step 1: Forward Pass Through the Model

The RM is a neural network. It takes (prompt, response) and outputs a logit (unnormalized score).

Let’s say the RM gives:

r_θ(x, y_A) = 0.8   (not great, uses too much jargon)
r_θ(x, y_B) = 2.1   (better, age-appropriate)

These are logits — not probabilities yet.

Step 2: Compute the Difference

In Bradley-Terry, we care about the relative ranking:

$$z = r_\theta(x, y_B) - r_\theta(x, y_A) = 2.1 - 0.8 = 1.3$$

A positive $z$ means the model predicted B is better, which matches the human judgment.

Step 3: Apply Sigmoid to Get Probability

The sigmoid function maps the difference to a probability:

$$P(B \text{ preferred}) = \sigma(z) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-1.3}}$$

Calculate: $$e^{-1.3} = 0.2725$$

$$\sigma(1.3) = \frac{1}{1 + 0.2725} = \frac{1}{1.2725} \approx 0.7856$$

Interpretation: The model is 78.56% confident that B is better. Since B actually is better (according to human), this is good.

Step 4: Compute Cross-Entropy Loss

Since B was judged better, the loss is:

$$L = -\log P(\text{B preferred}) = -\log(0.7856) \approx 0.2416$$

Interpretation: The loss is moderate. The model got it right, but wasn’t fully confident. If the model were more confident (say, 0.95), loss would be -log(0.95) ≈ 0.051 (lower). If the model predicted wrong (say, 0.3), loss would be -log(0.3) ≈ 1.204 (much higher).

Step 5: Backpropagation (Implicit)

In training, we compute gradients and update the model weights:

$$\theta \leftarrow \theta - \eta \nabla_\theta L$$

Where $\eta$ is the learning rate (e.g., 1e-5).

The gradient points in the direction that would increase $P(\text{B preferred})$ — i.e., make $r(y_B)$ higher and/or $r(y_A)$ lower.

Full Batch: Multiple Comparisons

In practice, we train on batches of comparison pairs. Let’s see three examples in one batch:

Example 1 (from above)

Prompt: "Explain why the sky is blue to a 5-year-old."
y_A: Too technical answer
y_B: Age-appropriate answer
Human: B is better

Reward Model output:
  r_A = 0.8, r_B = 2.1
  z = 1.3, σ(z) = 0.7856
  L₁ = -log(0.7856) = 0.2416

Example 2

Prompt: "What is the capital of France?"
y_A: "Paris. It is the largest city in France."
y_B: "Paris"
Human: A is better (more helpful context)

Reward Model output:
  r_A = 1.8, r_B = 1.2
  z = 1.2 - 1.8 = -0.6
  σ(-0.6) = 1/(1 + e^0.6) = 1/1.8221 ≈ 0.5487
  L₂ = -log(0.5487) ≈ 0.6006

Higher loss because the model got the ranking backwards (predicted B better when A is better).

Example 3

Prompt: "Tell me a fun fact about penguins."
y_A: "Penguins can swim up to 22 mph and dive 500+ meters deep."
y_B: "Did you know penguins can hold their breath for 6 minutes? They are
      amazing swimmers and can dive deeper than most people!"
Human: B is better (more engaging, better pacing)

Reward Model output:
  r_A = 1.5, r_B = 2.5
  z = 2.5 - 1.5 = 1.0
  σ(1.0) = 1/(1 + e^-1) = 1/1.3679 ≈ 0.7311
  L₃ = -log(0.7311) ≈ 0.3133

Batch Loss

Average loss over the three examples:

$$L_{\text{batch}} = \frac{1}{3}(L_1 + L_2 + L_3) = \frac{1}{3}(0.2416 + 0.6006 + 0.3133) = \frac{1.1555}{3} \approx 0.3852$$

Interpreting the Loss Values

L₁ = 0.2416: Model got it right (78.5% confidence). Good.
L₂ = 0.6006: Model got it wrong (54.9% confidence, barely above random). Training will focus on this one more.
L₃ = 0.3133: Model got it right but not very confident (73.1%). Could be better.

The backprop gradient will be largest for L₂ (wrong prediction), so this example will exert more influence on weight updates.

What the Model Learns

Over thousands of training examples, the reward model learns:

To value clarity: Age-appropriate explanations get higher scores
To value completeness: Answers with context score better than minimal answers
To value engagement: Interactive, interesting phrasings score higher
To penalize wordiness: Overly long explanations score lower
To handle trade-offs: Sometimes short is better (capital question), sometimes longer is better (fun fact)

The model internalizes these preferences via the loss gradients.

Why Bradley-Terry Instead of Regression?

An alternative would be regression: ask humans to rate A and B separately on a 1-5 scale, then train the RM to predict those ratings.

Why Bradley-Terry (relative ranking) is better:

Easier annotation: Comparing two things is easier than assigning absolute ratings. Humans are better at “A vs. B” than “What number is this?”
Less ambiguous: A 3-star response is ambiguous. Does it mean “good” or “mediocre”? With comparisons, the judgment is clear.
More data-efficient: You can generate many (y_A, y_B) pairs from fewer unique responses by comparing different pairs.
Natural human judgment: Humans naturally compare. Absolute ratings require calibration across raters.

In the paper, the RM achieves ~73% inter-rater agreement with humans on held-out test set. Not perfect, but good enough to guide RL.

After Training: Scoring New Responses

Once trained, the RM can score any (prompt, response) pair, even ones it never saw during training.

New Prompt: "How do birds fly?"
New Response (from the RL policy):
  "Birds have special bones that are hollow and light. Their wings have
   feathers arranged to push air down, creating lift. Flying is controlled
   by powerful flight muscles. Pretty cool!"

RM scores it:
  r_θ(x, y_new) = 1.7

In the context of training, this might score lower than a perfect response
(which might get r = 2.5), but higher than a poor response (which gets r = -0.5).

This score becomes the reward signal for the RL stage.

Key Takeaway

The RM training is:

Simple: Binary classification on preferences
Efficient: Trains fast, requires ~0.5 GPU hours
Generalizable: Learns patterns that transfer to new (prompt, response) pairs
Scalable: Can rank any number of responses without retraining

This is the bridge between human feedback (which is expensive) and RL (which can leverage a learned reward function).