LLMs
RLHF (Reinforcement Learning from Human Feedback)
Reinforcement Learning from Human Feedback explained — SFT, a reward model from preference comparisons, and PPO with a KL penalty — plus reward hacking and how DPO simplifies the pipeline.
RLHF — Reinforcement Learning from Human Feedback — is how a raw language model becomes a helpful, aligned assistant. Next-token pretraining optimizes for plausible text, not good text; RLHF adds a signal for what humans actually prefer. The classic recipe is three stages: (1) supervised fine-tuning (SFT) on demonstration data, (2) train a reward model from human preference comparisons, and (3) optimize the policy (the LLM) with reinforcement learning — usually PPO — to maximize that reward, while a KL penalty keeps it close to the SFT model.
It's the technique behind InstructGPT/ChatGPT-style alignment. The frame to hold: humans can't write the perfect answer for every prompt, but they can reliably say which of two answers is better — so we learn a reward model from comparisons and then optimize against it. Interviewers probe the three stages, why the KL penalty exists, reward hacking, and the PPO-vs-DPO tradeoff.
You can't directly backprop "be more helpful." So you build a proxy: show humans pairs of model outputs for the same prompt, ask which is better, and fit a reward model to those preferences. Then treat the LLM as a policy and use RL to produce outputs the reward model scores highly — but tether it to the original SFT model with a KL penalty so it doesn't drift into degenerate, reward-hacking text.
flowchart TB
P["Pretrained LLM"] --> S["1. SFT on demonstrations"]
S --> RM["2. Reward model from human A/B preferences"]
S --> POL["3. Policy = SFT model"]
RM --> RL["PPO: maximize reward − β·KL(policy ‖ SFT)"]
POL --> RL
RL --> A["Aligned model"]
The KL term is the safety leash: high reward with low KL means "better, but still recognizably the same model."
Reward model. Given human preference data where response is preferred over for prompt , fit a scalar reward with the Bradley-Terry loss:
i.e. push the preferred response's reward above the rejected one's.
Policy optimization. Optimize the policy to maximize reward while staying near the reference (SFT) policy :
PPO implements this with a clipped surrogate objective that limits how far the policy moves per update (stability). controls the leash: too small and the model reward-hacks; too large and it barely changes. DPO (Direct Preference Optimization) is the popular alternative — it derives a closed-form loss directly on the preference data that implicitly optimizes the same KL-regularized objective, skipping the separate reward model and RL loop entirely.
The reward-model loss and the KL-shaped reward — the two pieces unique to RLHF:
1import torch
2import torch.nn.functional as F
3from torch import Tensor
4
5
6def reward_model_loss(r_chosen: Tensor, r_rejected: Tensor) -> Tensor:
7 # Bradley-Terry: preferred reward should exceed rejected reward
8 return -F.logsigmoid(r_chosen - r_rejected).mean()
9
10
11def kl_shaped_reward(
12 reward: Tensor, logp_policy: Tensor, logp_ref: Tensor, beta: float = 0.1
13) -> Tensor:
14 # per-token reward used by PPO: task reward minus KL drift from the SFT model
15 kl = logp_policy - logp_ref # sample estimate of KL
16 return reward - beta * kl
17
18
19r_chosen, r_rejected = torch.tensor([2.1]), torch.tensor([0.4])
20assert reward_model_loss(r_chosen, r_rejected) < reward_model_loss(r_rejected, r_chosen)- Conceptual: What are the three stages of RLHF? (SFT on demonstrations → reward model from human preference comparisons → RL (PPO) optimizing the policy against the reward model.)
- Implementation: Why train a reward model from comparisons instead of absolute scores? (Humans are far more consistent ranking A vs B than assigning calibrated numeric scores; Bradley-Terry turns comparisons into a trainable reward.)
- Applied: What is the KL penalty for? (It keeps the optimized policy close to the SFT model, preventing drift into degenerate, reward-hacking outputs.)
- Systems-level: What is reward hacking and how does the setup mitigate it? (The policy exploits flaws in the imperfect reward model for high score / low quality; the KL leash and capping training steps limit it.)
- Failure modes: How does DPO differ from PPO-based RLHF? (DPO optimizes a closed-form preference loss directly on the data — no separate reward model or RL loop — implicitly solving the same KL-regularized objective.)
From memory: name the three RLHF stages, write the Bradley-Terry reward loss, write the KL-penalized objective and say what β controls. Then state one way DPO simplifies the pipeline. Check against Stages 1–3.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
LLMs
LoRA and PEFT
LoRA (Low-Rank Adaptation) explained — freeze the base model and train a tiny low-rank ΔW = BA, why it works, the parameter savings, QLoRA, and zero-latency merging — with code.
Deep Learning
Transformer Architecture
The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.
Optimization
Gradient Descent (SGD, Momentum, Adam)
SGD, momentum, and Adam explained — the update rules, why mini-batching wins, Adam's bias correction, and when plain SGD generalizes better — with from-scratch implementations.