LLMs
Temperature, Top-k & Top-p Sampling
How LLMs choose the next token — temperature, top-k, top-p (nucleus), and beam search — the math, the tradeoffs, and a runnable PyTorch sampler.
At each step, a language model outputs a probability distribution over the whole vocabulary; a decoding strategy decides which token to actually emit. The choices trade off coherence against diversity. Greedy always takes the argmax — safe but repetitive. Temperature reshapes the distribution before sampling. Top-k and top-p (nucleus) truncate the distribution to its most plausible tokens before sampling. Beam search keeps several high-probability sequences alive.
Getting these right is the difference between bland, looping text and fluent, varied generation. The frame to hold: the model gives you a distribution; decoding controls how much you trust the peak vs. explore the tail. Interviewers expect you to explain what temperature does to the logits, the difference between top-k and top-p, and when greedy/beam beats sampling.
Temperature scales the logits before softmax. Low temperature () sharpens the distribution toward the most likely token (approaching greedy); high temperature () flattens it, giving rarer tokens a real chance — more creative, more error-prone. Top-k and top-p then truncate: top-k keeps the k highest-probability tokens; top-p keeps the smallest set whose probabilities sum to at least p, adapting how many candidates survive to how confident the model is.
flowchart LR
L["Logits"] --> T["divide by temperature T"]
T --> S["softmax -> probabilities"]
S --> F["truncate: top-k or top-p"]
F --> P["renormalize and sample"]
Given logits , temperature rescales before softmax:
is the raw distribution; concentrates all mass on the argmax (greedy); flattens toward uniform.
Top-k keeps the set of the highest-probability tokens, zeros the rest, and renormalizes. Top-p (nucleus) keeps the smallest set such that
then renormalizes and samples from it. The key difference: top-k uses a fixed candidate count regardless of confidence; top-p uses a dynamic count — few candidates when the model is sure, many when it's uncertain. Beam search isn't sampling at all: it maintains the highest-probability partial sequences at each step, good for low-entropy tasks like translation but prone to dull, repetitive text in open-ended generation.
1import torch
2import torch.nn.functional as F
3
4def sample(logits: torch.Tensor, temperature=1.0, top_k=0, top_p=0.0) -> int:
5 logits = logits / max(temperature, 1e-8) # temperature scaling
6 if top_k > 0: # keep k highest logits
7 kth = torch.topk(logits, top_k).values[..., -1, None]
8 logits = logits.masked_fill(logits < kth, float("-inf"))
9 probs = F.softmax(logits, dim=-1)
10 if top_p > 0.0: # nucleus: keep cumulative >= p
11 sorted_probs, idx = torch.sort(probs, descending=True)
12 cutoff = torch.cumsum(sorted_probs, dim=-1) > top_p
13 cutoff[..., 0] = False # always keep the top token
14 sorted_probs[cutoff] = 0.0
15 probs = torch.zeros_like(probs).scatter(-1, idx, sorted_probs)
16 probs = probs / probs.sum()
17 return torch.multinomial(probs, 1).item()
18
19logits = torch.randn(50000)
20token = sample(logits, temperature=0.8, top_p=0.9)- Conceptual: What does temperature do to the next-token distribution? (Scales logits before softmax — low T sharpens toward greedy, high T flattens toward uniform/more random.)
- Implementation: What's the difference between top-k and top-p sampling? (Top-k keeps a fixed number of candidates; top-p keeps a dynamic number whose cumulative probability reaches p — adapting to the model's confidence.)
- Applied: When would you prefer greedy or beam search over sampling? (Low-entropy, single-correct-answer tasks like translation or summarization, where you want the most probable sequence rather than diversity.)
- Systems-level: What does temperature = 0 give you, and why is it useful? (Deterministic greedy decoding — reproducible outputs, useful for evals, tests, and tasks needing consistency.)
- Failure modes: Why can beam search produce dull or repetitive text in open-ended generation? (It optimizes for high total probability, which favors safe, generic continuations and degenerate repetition over natural diversity.)
From memory: write the temperature-scaled softmax, explain top-k vs. top-p in one sentence each, and name a task where greedy/beam beats sampling. Check against Stage 3.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
LLMs
KV Cache
How the KV cache makes autoregressive LLM decoding affordable — what it stores and why reuse is valid, the memory cost, why decoding is memory-bandwidth-bound, and how MQA/GQA shrink it — with code.
Deep Learning
Transformer Architecture
The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.
LLMs
Tokenization & BPE
How text becomes tokens — Byte-Pair Encoding, why subword beats word- and character-level, the training algorithm, and why token count drives context and cost — with code.