What does temperature do in LLM sampling?

It scales the logits before softmax: low temperature sharpens the distribution toward greedy (most likely token), high temperature flattens it toward more random, diverse output.

What is the difference between top-k and top-p sampling?

Top-k keeps a fixed number of highest-probability candidates; top-p (nucleus) keeps the smallest set whose cumulative probability reaches p — a dynamic count that adapts to the model's confidence.

When should you use greedy or beam search instead of sampling?

For low-entropy, single-correct-answer tasks like translation or summarization, where you want the most probable sequence rather than diversity.

What does temperature = 0 do?

It produces deterministic greedy decoding — reproducible outputs, useful for evaluations, tests, and tasks that need consistency.

Temperature & Top-p Sampling Explained (ML Interview)

How LLMs choose the next token — temperature, top-k, top-p (nucleus), and beam search — the math, the tradeoffs, and a runnable PyTorch sampler.

1Big Picture

At each step, a language model outputs a probability distribution over the whole vocabulary; a decoding strategy decides which token to actually emit. The choices trade off coherence against diversity. Greedy always takes the argmax — safe but repetitive. Temperature reshapes the distribution before sampling. Top-k and top-p (nucleus) truncate the distribution to its most plausible tokens before sampling. Beam search keeps several high-probability sequences alive.

Getting these right is the difference between bland, looping text and fluent, varied generation. The frame to hold: the model gives you a distribution; decoding controls how much you trust the peak vs. explore the tail. Interviewers expect you to explain what temperature does to the logits, the difference between top-k and top-p, and when greedy/beam beats sampling.

2Intuition + Visual

Temperature scales the logits before softmax. Low temperature ( $T \to 0$ ) sharpens the distribution toward the most likely token (approaching greedy); high temperature ( $T > 1$ ) flattens it, giving rarer tokens a real chance — more creative, more error-prone. Top-k and top-p then truncate: top-k keeps the k highest-probability tokens; top-p keeps the smallest set whose probabilities sum to at least p, adapting how many candidates survive to how confident the model is.

3The Math

Given logits $z$ , temperature $T$ rescales before softmax:

p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

$T = 1$ is the raw distribution; $T \to 0$ concentrates all mass on the argmax (greedy); $T > 1$ flattens toward uniform.

Top-k keeps the set $V_k$ of the $k$ highest-probability tokens, zeros the rest, and renormalizes. Top-p (nucleus) keeps the smallest set $V_p$ such that

\sum_{i \in V_p} p_i \ge p

then renormalizes and samples from it. The key difference: top-k uses a fixed candidate count regardless of confidence; top-p uses a dynamic count — few candidates when the model is sure, many when it's uncertain. Beam search isn't sampling at all: it maintains the $b$ highest-probability partial sequences at each step, good for low-entropy tasks like translation but prone to dull, repetitive text in open-ended generation.

4Implementation

python

1import torch
2import torch.nn.functional as F
3
4def sample(logits: torch.Tensor, temperature=1.0, top_k=0, top_p=0.0) -> int:
5    logits = logits / max(temperature, 1e-8)            # temperature scaling
6    if top_k > 0:                                       # keep k highest logits
7        kth = torch.topk(logits, top_k).values[..., -1, None]
8        logits = logits.masked_fill(logits < kth, float("-inf"))
9    probs = F.softmax(logits, dim=-1)
10    if top_p > 0.0:                                     # nucleus: keep cumulative >= p
11        sorted_probs, idx = torch.sort(probs, descending=True)
12        cutoff = torch.cumsum(sorted_probs, dim=-1) > top_p
13        cutoff[..., 0] = False                          # always keep the top token
14        sorted_probs[cutoff] = 0.0
15        probs = torch.zeros_like(probs).scatter(-1, idx, sorted_probs)
16        probs = probs / probs.sum()
17    return torch.multinomial(probs, 1).item()
18
19logits = torch.randn(50000)
20token = sample(logits, temperature=0.8, top_p=0.9)

5Interview Questions

Conceptual: What does temperature do to the next-token distribution? (Scales logits before softmax — low T sharpens toward greedy, high T flattens toward uniform/more random.)
Implementation: What's the difference between top-k and top-p sampling? (Top-k keeps a fixed number of candidates; top-p keeps a dynamic number whose cumulative probability reaches p — adapting to the model's confidence.)
Applied: When would you prefer greedy or beam search over sampling? (Low-entropy, single-correct-answer tasks like translation or summarization, where you want the most probable sequence rather than diversity.)
Systems-level: What does temperature = 0 give you, and why is it useful? (Deterministic greedy decoding — reproducible outputs, useful for evals, tests, and tasks needing consistency.)
Failure modes: Why can beam search produce dull or repetitive text in open-ended generation? (It optimizes for high total probability, which favors safe, generic continuations and degenerate repetition over natural diversity.)

6Retrieval Check

From memory: write the temperature-scaled softmax, explain top-k vs. top-p in one sentence each, and name a task where greedy/beam beats sampling. Check against Stage 3.

Temperature, Top-k & Top-p Sampling

Related concepts