Skip to content

LLMs

Temperature, Top-k & Top-p Sampling

How LLMs choose the next token — temperature, top-k, top-p (nucleus), and beam search — the math, the tradeoffs, and a runnable PyTorch sampler.

7 min readReviewed May 2026

1Big Picture

At each step, a language model outputs a probability distribution over the whole vocabulary; a decoding strategy decides which token to actually emit. The choices trade off coherence against diversity. Greedy always takes the argmax — safe but repetitive. Temperature reshapes the distribution before sampling. Top-k and top-p (nucleus) truncate the distribution to its most plausible tokens before sampling. Beam search keeps several high-probability sequences alive.

Getting these right is the difference between bland, looping text and fluent, varied generation. The frame to hold: the model gives you a distribution; decoding controls how much you trust the peak vs. explore the tail. Interviewers expect you to explain what temperature does to the logits, the difference between top-k and top-p, and when greedy/beam beats sampling.

2Intuition + Visual

Temperature scales the logits before softmax. Low temperature (T0T \to 0) sharpens the distribution toward the most likely token (approaching greedy); high temperature (T>1T > 1) flattens it, giving rarer tokens a real chance — more creative, more error-prone. Top-k and top-p then truncate: top-k keeps the k highest-probability tokens; top-p keeps the smallest set whose probabilities sum to at least p, adapting how many candidates survive to how confident the model is.

flowchart LR
    L["Logits"] --> T["divide by temperature T"]
    T --> S["softmax -> probabilities"]
    S --> F["truncate: top-k or top-p"]
    F --> P["renormalize and sample"]
3The Math

Given logits zz, temperature TT rescales before softmax:

pi=exp(zi/T)jexp(zj/T)p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

T=1T = 1 is the raw distribution; T0T \to 0 concentrates all mass on the argmax (greedy); T>1T > 1 flattens toward uniform.

Top-k keeps the set VkV_k of the kk highest-probability tokens, zeros the rest, and renormalizes. Top-p (nucleus) keeps the smallest set VpV_p such that

iVppip\sum_{i \in V_p} p_i \ge p

then renormalizes and samples from it. The key difference: top-k uses a fixed candidate count regardless of confidence; top-p uses a dynamic count — few candidates when the model is sure, many when it's uncertain. Beam search isn't sampling at all: it maintains the bb highest-probability partial sequences at each step, good for low-entropy tasks like translation but prone to dull, repetitive text in open-ended generation.

4Implementation
python
1import torch
2import torch.nn.functional as F
3
4def sample(logits: torch.Tensor, temperature=1.0, top_k=0, top_p=0.0) -> int:
5    logits = logits / max(temperature, 1e-8)            # temperature scaling
6    if top_k > 0:                                       # keep k highest logits
7        kth = torch.topk(logits, top_k).values[..., -1, None]
8        logits = logits.masked_fill(logits < kth, float("-inf"))
9    probs = F.softmax(logits, dim=-1)
10    if top_p > 0.0:                                     # nucleus: keep cumulative >= p
11        sorted_probs, idx = torch.sort(probs, descending=True)
12        cutoff = torch.cumsum(sorted_probs, dim=-1) > top_p
13        cutoff[..., 0] = False                          # always keep the top token
14        sorted_probs[cutoff] = 0.0
15        probs = torch.zeros_like(probs).scatter(-1, idx, sorted_probs)
16        probs = probs / probs.sum()
17    return torch.multinomial(probs, 1).item()
18
19logits = torch.randn(50000)
20token = sample(logits, temperature=0.8, top_p=0.9)
5Interview Questions
  1. Conceptual: What does temperature do to the next-token distribution? (Scales logits before softmax — low T sharpens toward greedy, high T flattens toward uniform/more random.)
  2. Implementation: What's the difference between top-k and top-p sampling? (Top-k keeps a fixed number of candidates; top-p keeps a dynamic number whose cumulative probability reaches p — adapting to the model's confidence.)
  3. Applied: When would you prefer greedy or beam search over sampling? (Low-entropy, single-correct-answer tasks like translation or summarization, where you want the most probable sequence rather than diversity.)
  4. Systems-level: What does temperature = 0 give you, and why is it useful? (Deterministic greedy decoding — reproducible outputs, useful for evals, tests, and tasks needing consistency.)
  5. Failure modes: Why can beam search produce dull or repetitive text in open-ended generation? (It optimizes for high total probability, which favors safe, generic continuations and degenerate repetition over natural diversity.)
6Retrieval Check

From memory: write the temperature-scaled softmax, explain top-k vs. top-p in one sentence each, and name a task where greedy/beam beats sampling. Check against Stage 3.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts