LLMs
KV Cache
How the KV cache makes autoregressive LLM decoding affordable — what it stores and why reuse is valid, the memory cost, why decoding is memory-bandwidth-bound, and how MQA/GQA shrink it — with code.
The KV cache is the optimization that makes autoregressive LLM generation affordable. When a decoder generates token by token, each new token attends to all previous tokens. Naively, every step recomputes the keys and values for the entire sequence so far — an enormous amount of repeated work. The insight: the keys and values of past tokens don't change as you generate more tokens, so you compute them once and cache them. Each new step then only computes the query, key, and value for the single new token and attends to the stored cache.
This turns the per-step cost from "recompute attention over the whole prefix" into "one new token against a growing cache." It's the difference between generation being quadratically wasteful and being practical. The frame to hold: cache past K and V; at each step compute only the new token's Q/K/V, append, and attend. Interviewers care that you know what is cached, why it's safe, the memory cost, and why decoding becomes memory-bandwidth-bound.
In a causal decoder, token attends to tokens . When you later generate token , tokens are unchanged, so their keys and values are identical to what you already computed. Recomputing them is pure waste. The KV cache stores them; generation appends the new token's K and V and reuses everything else.
flowchart LR
T["new token t"] --> Q["compute q_t, k_t, v_t"]
Q --> C["append k_t, v_t to KV cache"]
C --> A["attention: q_t over cached K, V (all past)"]
A --> O["output token t"]
O --> T
The cost moves from compute to memory: the cache grows linearly with sequence length, and at each step you read the entire cache, so decoding is limited by memory bandwidth, not arithmetic.
Without a cache, generating step recomputes K and V for all positions: roughly projection work per step, so generating tokens costs — the same quadratic blow-up as full attention, repeated.
With a cache, step computes only the new token's projections () and attends against the cached (). The redundant re-projection of the prefix is gone.
Memory cost of the cache for one sequence:
The factor of 2 is for K and V. This grows linearly with sequence length and batch size and quickly dominates memory for long contexts. Two standard mitigations shrink it by sharing keys/values across query heads: Multi-Query Attention (MQA) uses a single K/V head for all query heads; Grouped-Query Attention (GQA) uses a few — cutting cache size (and bandwidth) by the head-sharing factor with minimal quality loss.
1import torch
2from torch import Tensor
3
4
5class KVCacheAttention:
6 """Single-head causal attention with an incremental KV cache."""
7
8 def __init__(self) -> None:
9 self.k_cache: Tensor | None = None # (seq, d)
10 self.v_cache: Tensor | None = None
11
12 def step(self, q_t: Tensor, k_t: Tensor, v_t: Tensor) -> Tensor:
13 # q_t, k_t, v_t: (1, d) for the single new token
14 self.k_cache = k_t if self.k_cache is None else torch.cat([self.k_cache, k_t], dim=0)
15 self.v_cache = v_t if self.v_cache is None else torch.cat([self.v_cache, v_t], dim=0)
16 d = q_t.size(-1)
17 scores = (q_t @ self.k_cache.T) / d**0.5 # (1, seq_so_far) — no future to mask
18 weights = torch.softmax(scores, dim=-1)
19 return weights @ self.v_cache # (1, d)
20
21
22attn = KVCacheAttention()
23d = 64
24for _ in range(5): # generate 5 tokens
25 out = attn.step(torch.randn(1, d), torch.randn(1, d), torch.randn(1, d))
26assert attn.k_cache.shape == (5, d) and out.shape == (1, d)- Conceptual: What does the KV cache store and why is caching it correct? (Past tokens' keys and values — they don't change as new tokens are generated in a causal decoder, so they can be reused.)
- Implementation: How does the cache change the per-step cost of decoding? (From recomputing K/V over the whole prefix each step to computing only the new token's projections plus attention over the cache.)
- Applied: Write the memory cost of the cache and name what makes it grow. (2 · layers · heads · d_head · seq_len · bytes — grows linearly with sequence length and batch size.)
- Systems-level: Why is autoregressive decoding memory-bandwidth-bound rather than compute-bound? (Each step does little arithmetic but must read the entire growing cache from memory — bandwidth, not FLOPs, is the bottleneck.)
- Failure modes: How do MQA and GQA reduce the KV cache? (They share K/V heads across query heads — MQA uses one K/V head, GQA a few — shrinking cache size and bandwidth with little quality loss.)
Without looking: explain what the KV cache stores and why reuse is valid, give the memory-cost formula, and say why decoding is memory-bound. Then name two techniques that shrink the cache. Check against Stages 1–3.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
Deep Learning
Attention Mechanisms
How scaled dot-product and multi-head attention work — the soft key-value lookup at the heart of every Transformer — with the math, runnable PyTorch, and calibrated interview questions.
Deep Learning
Transformer Architecture
The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.
LLMs
LoRA and PEFT
LoRA (Low-Rank Adaptation) explained — freeze the base model and train a tiny low-rank ΔW = BA, why it works, the parameter savings, QLoRA, and zero-latency merging — with code.