What are the three stages of RLHF?

Supervised fine-tuning (SFT) on demonstrations, training a reward model from human preference comparisons, then optimizing the policy with reinforcement learning (PPO) against that reward model.

Why train a reward model from comparisons instead of absolute scores?

Humans are far more consistent ranking response A vs. B than assigning calibrated numeric scores; the Bradley-Terry model turns those comparisons into a trainable scalar reward.

What is the KL penalty in RLHF for?

It keeps the optimized policy close to the SFT reference model, preventing it from drifting into degenerate, reward-hacking outputs that exploit the imperfect reward model.

What is the difference between PPO and DPO?

PPO optimizes the policy with reinforcement learning against a separate reward model. DPO (Direct Preference Optimization) derives a closed-form loss directly on the preference data, implicitly solving the same KL-regularized objective without a reward model or RL loop.

RLHF Explained (ML Interview)

Reinforcement Learning from Human Feedback explained — SFT, a reward model from preference comparisons, and PPO with a KL penalty — plus reward hacking and how DPO simplifies the pipeline.

1Big Picture

RLHF — Reinforcement Learning from Human Feedback — is how a raw language model becomes a helpful, aligned assistant. Next-token pretraining optimizes for plausible text, not good text; RLHF adds a signal for what humans actually prefer. The classic recipe is three stages: (1) supervised fine-tuning (SFT) on demonstration data, (2) train a reward model from human preference comparisons, and (3) optimize the policy (the LLM) with reinforcement learning — usually PPO — to maximize that reward, while a KL penalty keeps it close to the SFT model.

It's the technique behind InstructGPT/ChatGPT-style alignment. The frame to hold: humans can't write the perfect answer for every prompt, but they can reliably say which of two answers is better — so we learn a reward model from comparisons and then optimize against it. Interviewers probe the three stages, why the KL penalty exists, reward hacking, and the PPO-vs-DPO tradeoff.

2Intuition + Visual

You can't directly backprop "be more helpful." So you build a proxy: show humans pairs of model outputs for the same prompt, ask which is better, and fit a reward model to those preferences. Then treat the LLM as a policy and use RL to produce outputs the reward model scores highly — but tether it to the original SFT model with a KL penalty so it doesn't drift into degenerate, reward-hacking text.

The KL term is the safety leash: high reward with low KL means "better, but still recognizably the same model."

3The Math

Reward model. Given human preference data where response $y_w$ is preferred over $y_l$ for prompt $x$ , fit a scalar reward $r_\phi$ with the Bradley-Terry loss:

\mathcal{L}_{RM} = -\,\mathbb{E}_{(x,y_w,y_l)}\Big[\log \sigma\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)\Big]

i.e. push the preferred response's reward above the rejected one's.

Policy optimization. Optimize the policy $\pi_\theta$ to maximize reward while staying near the reference (SFT) policy $\pi_{\text{ref}}$ :

\max_{\theta}\; \mathbb{E}_{x \sim D,\, y \sim \pi_\theta}\Big[\, r_\phi(x,y)\, \Big] - \beta\, \mathbb{D}_{\mathrm{KL}}\big(\pi_\theta(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)

PPO implements this with a clipped surrogate objective that limits how far the policy moves per update (stability). $\beta$ controls the leash: too small and the model reward-hacks; too large and it barely changes. DPO (Direct Preference Optimization) is the popular alternative — it derives a closed-form loss directly on the preference data that implicitly optimizes the same KL-regularized objective, skipping the separate reward model and RL loop entirely.

4Implementation

The reward-model loss and the KL-shaped reward — the two pieces unique to RLHF:

python

1import torch
2import torch.nn.functional as F
3from torch import Tensor
4
5
6def reward_model_loss(r_chosen: Tensor, r_rejected: Tensor) -> Tensor:
7    # Bradley-Terry: preferred reward should exceed rejected reward
8    return -F.logsigmoid(r_chosen - r_rejected).mean()
9
10
11def kl_shaped_reward(
12    reward: Tensor, logp_policy: Tensor, logp_ref: Tensor, beta: float = 0.1
13) -> Tensor:
14    # per-token reward used by PPO: task reward minus KL drift from the SFT model
15    kl = logp_policy - logp_ref               # sample estimate of KL
16    return reward - beta * kl
17
18
19r_chosen, r_rejected = torch.tensor([2.1]), torch.tensor([0.4])
20assert reward_model_loss(r_chosen, r_rejected) < reward_model_loss(r_rejected, r_chosen)

5Interview Questions

Conceptual: What are the three stages of RLHF? (SFT on demonstrations → reward model from human preference comparisons → RL (PPO) optimizing the policy against the reward model.)
Implementation: Why train a reward model from comparisons instead of absolute scores? (Humans are far more consistent ranking A vs B than assigning calibrated numeric scores; Bradley-Terry turns comparisons into a trainable reward.)
Applied: What is the KL penalty for? (It keeps the optimized policy close to the SFT model, preventing drift into degenerate, reward-hacking outputs.)
Systems-level: What is reward hacking and how does the setup mitigate it? (The policy exploits flaws in the imperfect reward model for high score / low quality; the KL leash and capping training steps limit it.)
Failure modes: How does DPO differ from PPO-based RLHF? (DPO optimizes a closed-form preference loss directly on the data — no separate reward model or RL loop — implicitly solving the same KL-regularized objective.)

6Retrieval Check

From memory: name the three RLHF stages, write the Bradley-Terry reward loss, write the KL-penalized objective and say what β controls. Then state one way DPO simplifies the pipeline. Check against Stages 1–3.

RLHF (Reinforcement Learning from Human Feedback)

Related concepts