LLMs
Tokenization & BPE
How text becomes tokens — Byte-Pair Encoding, why subword beats word- and character-level, the training algorithm, and why token count drives context and cost — with code.
Before a language model sees text, the text must become a sequence of integer tokens. Tokenization decides what those units are. Word-level vocabularies explode in size and choke on rare or unseen words (out-of-vocabulary); character-level sequences are tiny in vocabulary but very long. Byte-Pair Encoding (BPE) is the subword compromise that won: it learns a vocabulary of frequent character chunks, so common words become single tokens while rare words split into meaningful pieces — no OOV, manageable vocabulary, reasonable sequence length.
Every modern LLM tokenizes with a BPE variant (GPT uses byte-level BPE; BERT uses the closely related WordPiece). The frame to hold: BPE starts from characters and greedily merges the most frequent adjacent pair, over and over, until it hits a target vocabulary size. Interviewers care that you understand why subword, the vocab-size/sequence-length tradeoff, and that token count — not word count — drives context limits and API cost.
Training: look at a big corpus as sequences of characters. Find the most frequent adjacent pair (say t + h → th), merge it into a new symbol, and add it to the vocabulary. Repeat. Frequent sequences like ing, tion, or whole common words get absorbed into single tokens; rare words stay split into subwords.
flowchart LR
A["l o w e r"] --> B["merge 'e r' -> 'er'"]
B --> C["l o w er"]
C --> D["merge 'l o' -> 'lo'"]
D --> E["lo w er"]
Encoding new text just replays the learned merges in the order they were learned. The result: "tokenization" might be one token, while "antidisestablishmentarianism" becomes several subwords — but nothing is ever unrepresentable.
BPE is an algorithm rather than a closed-form equation. Training to a target vocabulary size :
- Initialize the vocabulary with all base symbols (characters, or the 256 bytes for byte-level BPE).
- Count the frequency of every adjacent symbol pair across the corpus.
- Merge the most frequent pair ; add to the vocabulary and record the merge rule.
- Repeat steps 2–3 until (e.g. 50,257 for GPT-2).
Encoding a new string: greedily apply the learned merge rules in learned order until no more apply. The number of tokens for a piece of text is what counts against a model's context window and per-token API pricing — so tokenization efficiency directly affects cost and how much you can fit in context.
1from collections import Counter
2
3def get_pairs(tokens: list[str]) -> Counter:
4 return Counter(zip(tokens, tokens[1:]))
5
6def train_bpe(corpus: list[str], num_merges: int) -> list[tuple[str, str]]:
7 tokens = list(" ".join(corpus)) # char-level start
8 merges: list[tuple[str, str]] = []
9 for _ in range(num_merges):
10 pairs = get_pairs(tokens)
11 if not pairs:
12 break
13 best = pairs.most_common(1)[0][0] # most frequent adjacent pair
14 merges.append(best)
15 merged, i = [], 0
16 while i < len(tokens): # apply the merge in one pass
17 if i < len(tokens) - 1 and (tokens[i], tokens[i + 1]) == best:
18 merged.append(tokens[i] + tokens[i + 1])
19 i += 2
20 else:
21 merged.append(tokens[i])
22 i += 1
23 tokens = merged
24 return merges
25
26merges = train_bpe(["lower", "lowest", "newer"], num_merges=5)
27print(merges) # e.g. [('e','r'), ('l','o'), ('lo','w'), ...]- Conceptual: Why use subword tokenization instead of words or characters? (Words explode the vocabulary and hit OOV; characters make sequences too long. Subwords balance vocabulary size, sequence length, and never produce OOV.)
- Implementation: What does one BPE training step do? (Count adjacent symbol-pair frequencies and merge the most frequent pair into a new vocabulary symbol.)
- Applied: Why does token count, not word count, matter for context limits and cost? (The model and pricing operate on tokens; a rare word may be several tokens, so token count determines how much fits in context and what you pay.)
- Systems-level: What is byte-level BPE and what problem does it solve? (BPE over raw bytes instead of characters — it can represent any Unicode text, including emoji and rare scripts, with no unknown tokens.)
- Failure modes: How does BPE vs. WordPiece differ? (Both are subword; BPE merges by raw pair frequency, WordPiece merges by which pair most increases corpus likelihood under a language model.)
Without looking: describe the BPE training loop in three steps, explain why subword beats word- and character-level, and say why token count drives cost and context. Check against Stage 3.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
LLMs
Embeddings
How discrete tokens become dense vectors that capture meaning — static (word2vec) vs. contextual embeddings, cosine similarity, and negative sampling — with code.
Deep Learning
Transformer Architecture
The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.
LLMs
Temperature, Top-k & Top-p Sampling
How LLMs choose the next token — temperature, top-k, top-p (nucleus), and beam search — the math, the tradeoffs, and a runnable PyTorch sampler.