Skip to content

LLMs

RLHF (Reinforcement Learning from Human Feedback)

Reinforcement Learning from Human Feedback explained — SFT, a reward model from preference comparisons, and PPO with a KL penalty — plus reward hacking and how DPO simplifies the pipeline.

9 min readReviewed May 2026

1Big Picture

RLHF — Reinforcement Learning from Human Feedback — is how a raw language model becomes a helpful, aligned assistant. Next-token pretraining optimizes for plausible text, not good text; RLHF adds a signal for what humans actually prefer. The classic recipe is three stages: (1) supervised fine-tuning (SFT) on demonstration data, (2) train a reward model from human preference comparisons, and (3) optimize the policy (the LLM) with reinforcement learning — usually PPO — to maximize that reward, while a KL penalty keeps it close to the SFT model.

It's the technique behind InstructGPT/ChatGPT-style alignment. The frame to hold: humans can't write the perfect answer for every prompt, but they can reliably say which of two answers is better — so we learn a reward model from comparisons and then optimize against it. Interviewers probe the three stages, why the KL penalty exists, reward hacking, and the PPO-vs-DPO tradeoff.

2Intuition + Visual

You can't directly backprop "be more helpful." So you build a proxy: show humans pairs of model outputs for the same prompt, ask which is better, and fit a reward model to those preferences. Then treat the LLM as a policy and use RL to produce outputs the reward model scores highly — but tether it to the original SFT model with a KL penalty so it doesn't drift into degenerate, reward-hacking text.

flowchart TB
    P["Pretrained LLM"] --> S["1. SFT on demonstrations"]
    S --> RM["2. Reward model from human A/B preferences"]
    S --> POL["3. Policy = SFT model"]
    RM --> RL["PPO: maximize reward − β·KL(policy ‖ SFT)"]
    POL --> RL
    RL --> A["Aligned model"]

The KL term is the safety leash: high reward with low KL means "better, but still recognizably the same model."

3The Math

Reward model. Given human preference data where response ywy_w is preferred over yly_l for prompt xx, fit a scalar reward rϕr_\phi with the Bradley-Terry loss:

LRM=E(x,yw,yl)[logσ(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_{RM} = -\,\mathbb{E}_{(x,y_w,y_l)}\Big[\log \sigma\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)\Big]

i.e. push the preferred response's reward above the rejected one's.

Policy optimization. Optimize the policy πθ\pi_\theta to maximize reward while staying near the reference (SFT) policy πref\pi_{\text{ref}}:

maxθ  ExD,yπθ[rϕ(x,y)]βDKL(πθ(x)πref(x))\max_{\theta}\; \mathbb{E}_{x \sim D,\, y \sim \pi_\theta}\Big[\, r_\phi(x,y)\, \Big] - \beta\, \mathbb{D}_{\mathrm{KL}}\big(\pi_\theta(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)

PPO implements this with a clipped surrogate objective that limits how far the policy moves per update (stability). β\beta controls the leash: too small and the model reward-hacks; too large and it barely changes. DPO (Direct Preference Optimization) is the popular alternative — it derives a closed-form loss directly on the preference data that implicitly optimizes the same KL-regularized objective, skipping the separate reward model and RL loop entirely.

4Implementation

The reward-model loss and the KL-shaped reward — the two pieces unique to RLHF:

python
1import torch
2import torch.nn.functional as F
3from torch import Tensor
4
5
6def reward_model_loss(r_chosen: Tensor, r_rejected: Tensor) -> Tensor:
7    # Bradley-Terry: preferred reward should exceed rejected reward
8    return -F.logsigmoid(r_chosen - r_rejected).mean()
9
10
11def kl_shaped_reward(
12    reward: Tensor, logp_policy: Tensor, logp_ref: Tensor, beta: float = 0.1
13) -> Tensor:
14    # per-token reward used by PPO: task reward minus KL drift from the SFT model
15    kl = logp_policy - logp_ref               # sample estimate of KL
16    return reward - beta * kl
17
18
19r_chosen, r_rejected = torch.tensor([2.1]), torch.tensor([0.4])
20assert reward_model_loss(r_chosen, r_rejected) < reward_model_loss(r_rejected, r_chosen)
5Interview Questions
  1. Conceptual: What are the three stages of RLHF? (SFT on demonstrations → reward model from human preference comparisons → RL (PPO) optimizing the policy against the reward model.)
  2. Implementation: Why train a reward model from comparisons instead of absolute scores? (Humans are far more consistent ranking A vs B than assigning calibrated numeric scores; Bradley-Terry turns comparisons into a trainable reward.)
  3. Applied: What is the KL penalty for? (It keeps the optimized policy close to the SFT model, preventing drift into degenerate, reward-hacking outputs.)
  4. Systems-level: What is reward hacking and how does the setup mitigate it? (The policy exploits flaws in the imperfect reward model for high score / low quality; the KL leash and capping training steps limit it.)
  5. Failure modes: How does DPO differ from PPO-based RLHF? (DPO optimizes a closed-form preference loss directly on the data — no separate reward model or RL loop — implicitly solving the same KL-regularized objective.)
6Retrieval Check

From memory: name the three RLHF stages, write the Bradley-Terry reward loss, write the KL-penalized objective and say what β controls. Then state one way DPO simplifies the pipeline. Check against Stages 1–3.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts