Why use subword tokenization instead of words or characters?

Word vocabularies explode in size and hit out-of-vocabulary words; character sequences are too long. Subword units balance vocabulary size, sequence length, and never produce OOV tokens.

How does Byte-Pair Encoding (BPE) work?

It starts from characters (or bytes) and repeatedly merges the most frequent adjacent pair into a new vocabulary symbol until it reaches a target vocabulary size.

What is byte-level BPE?

BPE applied over raw bytes instead of characters, so it can represent any Unicode text — emoji, rare scripts — with no unknown tokens. GPT models use it.

What is the difference between BPE and WordPiece?

Both are subword methods; BPE merges by raw pair frequency, while WordPiece merges the pair that most increases the corpus likelihood under a language model.

Tokenization & BPE Explained (ML Interview)

How text becomes tokens — Byte-Pair Encoding, why subword beats word- and character-level, the training algorithm, and why token count drives context and cost — with code.

1Big Picture

Before a language model sees text, the text must become a sequence of integer tokens. Tokenization decides what those units are. Word-level vocabularies explode in size and choke on rare or unseen words (out-of-vocabulary); character-level sequences are tiny in vocabulary but very long. Byte-Pair Encoding (BPE) is the subword compromise that won: it learns a vocabulary of frequent character chunks, so common words become single tokens while rare words split into meaningful pieces — no OOV, manageable vocabulary, reasonable sequence length.

Every modern LLM tokenizes with a BPE variant (GPT uses byte-level BPE; BERT uses the closely related WordPiece). The frame to hold: BPE starts from characters and greedily merges the most frequent adjacent pair, over and over, until it hits a target vocabulary size. Interviewers care that you understand why subword, the vocab-size/sequence-length tradeoff, and that token count — not word count — drives context limits and API cost.

2Intuition + Visual

Training: look at a big corpus as sequences of characters. Find the most frequent adjacent pair (say t + h → th), merge it into a new symbol, and add it to the vocabulary. Repeat. Frequent sequences like ing, tion, or whole common words get absorbed into single tokens; rare words stay split into subwords.

Encoding new text just replays the learned merges in the order they were learned. The result: "tokenization" might be one token, while "antidisestablishmentarianism" becomes several subwords — but nothing is ever unrepresentable.

3The Math·The Math / Algorithm

BPE is an algorithm rather than a closed-form equation. Training to a target vocabulary size $V$ :

Initialize the vocabulary with all base symbols (characters, or the 256 bytes for byte-level BPE).
Count the frequency of every adjacent symbol pair across the corpus.
Merge the most frequent pair $(a, b) \to ab$ ; add $ab$ to the vocabulary and record the merge rule.
Repeat steps 2–3 until $|\text{vocab}| = V$ (e.g. 50,257 for GPT-2).

Encoding a new string: greedily apply the learned merge rules in learned order until no more apply. The number of tokens for a piece of text is what counts against a model's context window and per-token API pricing — so tokenization efficiency directly affects cost and how much you can fit in context.

4Implementation

python

1from collections import Counter
2
3def get_pairs(tokens: list[str]) -> Counter:
4    return Counter(zip(tokens, tokens[1:]))
5
6def train_bpe(corpus: list[str], num_merges: int) -> list[tuple[str, str]]:
7    tokens = list(" ".join(corpus))          # char-level start
8    merges: list[tuple[str, str]] = []
9    for _ in range(num_merges):
10        pairs = get_pairs(tokens)
11        if not pairs:
12            break
13        best = pairs.most_common(1)[0][0]    # most frequent adjacent pair
14        merges.append(best)
15        merged, i = [], 0
16        while i < len(tokens):               # apply the merge in one pass
17            if i < len(tokens) - 1 and (tokens[i], tokens[i + 1]) == best:
18                merged.append(tokens[i] + tokens[i + 1])
19                i += 2
20            else:
21                merged.append(tokens[i])
22                i += 1
23        tokens = merged
24    return merges
25
26merges = train_bpe(["lower", "lowest", "newer"], num_merges=5)
27print(merges)   # e.g. [('e','r'), ('l','o'), ('lo','w'), ...]

5Interview Questions

Conceptual: Why use subword tokenization instead of words or characters? (Words explode the vocabulary and hit OOV; characters make sequences too long. Subwords balance vocabulary size, sequence length, and never produce OOV.)
Implementation: What does one BPE training step do? (Count adjacent symbol-pair frequencies and merge the most frequent pair into a new vocabulary symbol.)
Applied: Why does token count, not word count, matter for context limits and cost? (The model and pricing operate on tokens; a rare word may be several tokens, so token count determines how much fits in context and what you pay.)
Systems-level: What is byte-level BPE and what problem does it solve? (BPE over raw bytes instead of characters — it can represent any Unicode text, including emoji and rare scripts, with no unknown tokens.)
Failure modes: How does BPE vs. WordPiece differ? (Both are subword; BPE merges by raw pair frequency, WordPiece merges by which pair most increases corpus likelihood under a language model.)

6Retrieval Check

Without looking: describe the BPE training loop in three steps, explain why subword beats word- and character-level, and say why token count drives cost and context. Check against Stage 3.

Tokenization & BPE

Related concepts