What does the KV cache store in a transformer?

The keys and values of all previously generated tokens. In a causal decoder those don't change as new tokens are produced, so they're cached and reused instead of recomputed.

How does the KV cache speed up LLM inference?

It changes each decoding step from recomputing keys/values over the entire prefix to computing only the new token's projections and attending against the cached keys and values.

How much memory does the KV cache use?

About 2 · n_layers · n_heads · d_head · seq_len · bytes per sequence (the 2 is for keys and values). It grows linearly with sequence length and batch size and dominates memory for long contexts.

What are MQA and GQA?

Multi-Query Attention shares a single key/value head across all query heads; Grouped-Query Attention shares a few. Both shrink the KV cache and memory bandwidth with minimal quality loss.

KV Cache Explained (LLM Inference Interview)

How the KV cache makes autoregressive LLM decoding affordable — what it stores and why reuse is valid, the memory cost, why decoding is memory-bandwidth-bound, and how MQA/GQA shrink it — with code.

1Big Picture

The KV cache is the optimization that makes autoregressive LLM generation affordable. When a decoder generates token by token, each new token attends to all previous tokens. Naively, every step recomputes the keys and values for the entire sequence so far — an enormous amount of repeated work. The insight: the keys and values of past tokens don't change as you generate more tokens, so you compute them once and cache them. Each new step then only computes the query, key, and value for the single new token and attends to the stored cache.

This turns the per-step cost from "recompute attention over the whole prefix" into "one new token against a growing cache." It's the difference between generation being quadratically wasteful and being practical. The frame to hold: cache past K and V; at each step compute only the new token's Q/K/V, append, and attend. Interviewers care that you know what is cached, why it's safe, the memory cost, and why decoding becomes memory-bandwidth-bound.

2Intuition + Visual

In a causal decoder, token $t$ attends to tokens $1..t$ . When you later generate token $t+1$ , tokens $1..t$ are unchanged, so their keys and values are identical to what you already computed. Recomputing them is pure waste. The KV cache stores them; generation appends the new token's K and V and reuses everything else.

The cost moves from compute to memory: the cache grows linearly with sequence length, and at each step you read the entire cache, so decoding is limited by memory bandwidth, not arithmetic.

3The Math

Without a cache, generating step $t$ recomputes K and V for all $t$ positions: roughly $O(t \cdot d)$ projection work per step, so generating $n$ tokens costs $O(n^2 d)$ — the same quadratic blow-up as full attention, repeated.

With a cache, step $t$ computes only the new token's projections ( $O(d)$ ) and attends against the cached $K, V$ ( $O(t \cdot d)$ ). The redundant re-projection of the prefix is gone.

Memory cost of the cache for one sequence:

\text{KV memory} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times \text{seq\_len} \times \text{bytes}

The factor of 2 is for K and V. This grows linearly with sequence length and batch size and quickly dominates memory for long contexts. Two standard mitigations shrink it by sharing keys/values across query heads: Multi-Query Attention (MQA) uses a single K/V head for all query heads; Grouped-Query Attention (GQA) uses a few — cutting cache size (and bandwidth) by the head-sharing factor with minimal quality loss.

4Implementation

python

1import torch
2from torch import Tensor
3
4
5class KVCacheAttention:
6    """Single-head causal attention with an incremental KV cache."""
7
8    def __init__(self) -> None:
9        self.k_cache: Tensor | None = None      # (seq, d)
10        self.v_cache: Tensor | None = None
11
12    def step(self, q_t: Tensor, k_t: Tensor, v_t: Tensor) -> Tensor:
13        # q_t, k_t, v_t: (1, d) for the single new token
14        self.k_cache = k_t if self.k_cache is None else torch.cat([self.k_cache, k_t], dim=0)
15        self.v_cache = v_t if self.v_cache is None else torch.cat([self.v_cache, v_t], dim=0)
16        d = q_t.size(-1)
17        scores = (q_t @ self.k_cache.T) / d**0.5     # (1, seq_so_far) — no future to mask
18        weights = torch.softmax(scores, dim=-1)
19        return weights @ self.v_cache                # (1, d)
20
21
22attn = KVCacheAttention()
23d = 64
24for _ in range(5):                                   # generate 5 tokens
25    out = attn.step(torch.randn(1, d), torch.randn(1, d), torch.randn(1, d))
26assert attn.k_cache.shape == (5, d) and out.shape == (1, d)

5Interview Questions

Conceptual: What does the KV cache store and why is caching it correct? (Past tokens' keys and values — they don't change as new tokens are generated in a causal decoder, so they can be reused.)
Implementation: How does the cache change the per-step cost of decoding? (From recomputing K/V over the whole prefix each step to computing only the new token's projections plus attention over the cache.)
Applied: Write the memory cost of the cache and name what makes it grow. (2 · layers · heads · d_head · seq_len · bytes — grows linearly with sequence length and batch size.)
Systems-level: Why is autoregressive decoding memory-bandwidth-bound rather than compute-bound? (Each step does little arithmetic but must read the entire growing cache from memory — bandwidth, not FLOPs, is the bottleneck.)
Failure modes: How do MQA and GQA reduce the KV cache? (They share K/V heads across query heads — MQA uses one K/V head, GQA a few — shrinking cache size and bandwidth with little quality loss.)

6Retrieval Check

Without looking: explain what the KV cache stores and why reuse is valid, give the memory-cost formula, and say why decoding is memory-bound. Then name two techniques that shrink the cache. Check against Stages 1–3.

KV Cache

Related concepts