The Clever Trick at the Heart of RLHF

I've been doing RL for a while. So when I read about RLHF I expected it all to make sense: the reward signal, the policy gradient updates, the PPO loop. And it did. Except for one part I hadn't thought through: where the reward signal actually comes from.

Someone has to sit down and answer a deceptively hard question. How do you turn human judgment into a number?

The obvious answer is a rating scale. Ask annotators to score each response from 1 to 10. Clean, simple and completely broken.

Here's why. If you show someone a response and ask whether it's a 6 or a 7, they'll agonize. Different annotators calibrate differently. The same annotator drifts over sessions. You end up with noisy inconsistent numbers as your training signal and you're trying to build a reward model on top of them.

The insight that fixes this is simple. Humans are bad at absolute judgments but good at relative ones.

Don't ask me to rate response A. Show me response A and response B and ask which is better. That I can do. Consistently, quickly and reliably.

TRY IT

PROMPTExplain how a neural network learns from its mistakes.
RATE THIS RESPONSE (1–10)
The network adjusts its weights using backpropagation, which applies the chain rule to compute each weight's gradient with respect to the loss, then nudges weights in the direction that reduces that loss.
WHICH IS BETTER?

Rate the response on the left, then pick a winner on the right. Notice which one you finish first.

Now you have a pile of pairwise comparisons. A beat B. C beat A. B beat D. The question is: how do you convert that tournament data into a single ranking?

The answer was already sitting in an unlikely place. ELO.

Arpad Elo invented his rating system in the 1960s to rank chess players. Every player has a scalar rating $R$ . Before any game, you predict the outcome from the rating difference alone:

E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}

This is a logistic sigmoid stretched across rating space. Equal ratings means $E_A = 0.5$ . A 400-point gap means roughly a 91% win probability. The 400 is just a calibration constant that sets the scale.

TRY IT

A
1400
click to edit
VS
B
1000
click to edit
90.9%win probability9.1%

Drag the sliders to set two ratings and watch the win probability update.

After the game both ratings update:

R_A' = R_A + K(S_A - E_A)

R_B' = R_B + K(S_B - E_B)

Where:

$S_A \in \{1, 0.5, 0\}$ is the actual result (win, draw, loss)
$K$ is a learning rate, typically 16 to 32
$E_A$ is the pre-match expected score

$K$ deserves a closer look. A small $K$ means ratings shift slowly, damping out noise at the cost of responsiveness. A large $K$ means faster convergence but more volatility. Chess federations use different values depending on how established a player is: new players get $K = 40$ , top players $K = 16$ . RLHF reward models face the same tradeoff.

TRY IT

K = 8
K = 16
K = 32
1150120012500102030401200matches played → rating. all three start at 1200 against random opponents.

Three simulations of 40 matches starting at 1200. Higher K converges faster but bounces around more. Hit regenerate to run a different sequence.

iSurprise drives the updateNote

The update magnitude is proportional to how unexpected the outcome was. Expected win: ratings barely move. Upset: large shift. The system is self-correcting.

TRY IT

The entire ELO win probability in one picture. X is R_a, Y is R_b, Z is P(A wins). Flat along the diagonal, steep at the corners. Rotate it.

Now here's the part that made me stop. ELO isn't some ad hoc system Elo invented out of thin air. It's a special case of the Bradley-Terry model.

Bradley-Terry says the probability that item $i$ is preferred over item $j$ is:

P(i \succ j) = \frac{e^{\beta_i}}{e^{\beta_i} + e^{\beta_j}} = \sigma(\beta_i - \beta_j)

Where $\beta_i$ is a scalar strength score and $\sigma$ is the logistic sigmoid. The claim is that some latent quality determines preference, and each pairwise comparison is a noisy observation of that quality difference.

This is exactly the model behind RLHF. But there's a step here that's easy to gloss over: the human never produces a number. They just click a button. So where does the scalar actually come from?

The reward model $r_\theta(x, y)$ outputs a single number for any prompt-response pair. When you say "A is better than B," that gives the training loop one labeled data point: response $y_A$ beat $y_B$ given prompt $x$ . The loss function is designed so that minimizing it forces the model's score for $y_A$ above its score for $y_B$ :

\mathcal{L}(\theta) = -\mathbb{E}_{(x,\, y_w,\, y_l) \sim \mathcal{D}} \left[ \log \sigma\!\left( r_\theta(x, y_w) - r_\theta(x, y_l) \right) \right]

The $\sigma$ maps the score difference to a probability. The loss is small when $r_\theta(x, y_w) - r_\theta(x, y_l)$ is large and positive. So gradient descent nudges the weights to push the winner's score up and the loser's score down. Your binary "A is better" label never becomes a number directly. It becomes a constraint: whatever numbers the model assigns, the winning response must score higher.

Do this across tens of thousands of comparisons and the outputs get calibrated into a consistent scale.

A useful analogy: suppose you want to estimate everyone's height but you can only ask pairwise questions. "Is person A taller than B?" You never get a direct measurement. But after enough questions you can fit a model that assigns heights consistent with all your answers. The heights that come out are the reward scores. Your binary clicks were the training signal, not the numbers themselves.

≡The key ideaSummary

The human provides a binary label. The reward model provides the numbers. The loss function is the bridge: it shapes the numbers so that higher always means "more preferred by humans."

The model learns to assign scores such that the difference between them predicts human preference. Not absolute quality. Relative quality. The ELO insight all the way down to the loss function.

TRY IT

RESPONSE A
1400
VS
RESPONSE B
1000
P(A wins) = 91%P(B wins) = 9%
Initial state

Two responses going head to head. Press → to run a match. Small shifts when the outcome is expected, larger ones when the underdog wins. Press ← to step back.

Once you have the reward model, PPO fine-tunes the language model to maximize $r_\theta$ . The full loop is:

Sample a prompt
Generate two responses
Human picks the better one
Update the reward model with the Bradley-Terry loss
Use the reward model as signal for PPO to update LLM weights
Repeat thousands of times

Everything downstream, the Bradley-Terry loss, the reward model architecture and the PPO KL penalty, flows from that one early decision: ask for comparisons, not scores.

◆Why ELO specificallyTip

You could have used something else. Thurstone scaling, rank aggregation algorithms, something bespoke. But ELO is simple, incrementally updatable and validated across decades of competitive games. Once you've framed the problem as a tournament, it's the obvious tool.

There's a broader point here. When you're collecting human feedback for anything, the framing of the question determines the quality of the data. Absolute scales feel intuitive because we think we have some internal quality meter. We don't. What we actually have are comparison engines:

We notice differences between options
We rank things against each other
We pick the better of two

Psychologists have known this since at least Fechner in the 1860s. Weber's law, Thurstone's law of comparative judgment... the whole apparatus of psychophysics is built on the observation that perception is relative, not absolute.

RLHF didn't discover this. But applying it here, in this context and for this problem, turns a messy annotation problem into something tractable and scalable.

And that's why ELO, a system invented to rank chess players in 1960, sits quietly at the core of how the most powerful language models in the world are trained today.