Skip to content

LLMs

Tokenization & BPE

How text becomes tokens — Byte-Pair Encoding, why subword beats word- and character-level, the training algorithm, and why token count drives context and cost — with code.

7 min readReviewed May 2026

1Big Picture

Before a language model sees text, the text must become a sequence of integer tokens. Tokenization decides what those units are. Word-level vocabularies explode in size and choke on rare or unseen words (out-of-vocabulary); character-level sequences are tiny in vocabulary but very long. Byte-Pair Encoding (BPE) is the subword compromise that won: it learns a vocabulary of frequent character chunks, so common words become single tokens while rare words split into meaningful pieces — no OOV, manageable vocabulary, reasonable sequence length.

Every modern LLM tokenizes with a BPE variant (GPT uses byte-level BPE; BERT uses the closely related WordPiece). The frame to hold: BPE starts from characters and greedily merges the most frequent adjacent pair, over and over, until it hits a target vocabulary size. Interviewers care that you understand why subword, the vocab-size/sequence-length tradeoff, and that token count — not word count — drives context limits and API cost.

2Intuition + Visual

Training: look at a big corpus as sequences of characters. Find the most frequent adjacent pair (say t + hth), merge it into a new symbol, and add it to the vocabulary. Repeat. Frequent sequences like ing, tion, or whole common words get absorbed into single tokens; rare words stay split into subwords.

flowchart LR
    A["l o w e r"] --> B["merge 'e r' -> 'er'"]
    B --> C["l o w er"]
    C --> D["merge 'l o' -> 'lo'"]
    D --> E["lo w er"]

Encoding new text just replays the learned merges in the order they were learned. The result: "tokenization" might be one token, while "antidisestablishmentarianism" becomes several subwords — but nothing is ever unrepresentable.

3The Math·The Math / Algorithm

BPE is an algorithm rather than a closed-form equation. Training to a target vocabulary size VV:

  1. Initialize the vocabulary with all base symbols (characters, or the 256 bytes for byte-level BPE).
  2. Count the frequency of every adjacent symbol pair across the corpus.
  3. Merge the most frequent pair (a,b)ab(a, b) \to ab; add abab to the vocabulary and record the merge rule.
  4. Repeat steps 2–3 until vocab=V|\text{vocab}| = V (e.g. 50,257 for GPT-2).

Encoding a new string: greedily apply the learned merge rules in learned order until no more apply. The number of tokens for a piece of text is what counts against a model's context window and per-token API pricing — so tokenization efficiency directly affects cost and how much you can fit in context.

4Implementation
python
1from collections import Counter
2
3def get_pairs(tokens: list[str]) -> Counter:
4    return Counter(zip(tokens, tokens[1:]))
5
6def train_bpe(corpus: list[str], num_merges: int) -> list[tuple[str, str]]:
7    tokens = list(" ".join(corpus))          # char-level start
8    merges: list[tuple[str, str]] = []
9    for _ in range(num_merges):
10        pairs = get_pairs(tokens)
11        if not pairs:
12            break
13        best = pairs.most_common(1)[0][0]    # most frequent adjacent pair
14        merges.append(best)
15        merged, i = [], 0
16        while i < len(tokens):               # apply the merge in one pass
17            if i < len(tokens) - 1 and (tokens[i], tokens[i + 1]) == best:
18                merged.append(tokens[i] + tokens[i + 1])
19                i += 2
20            else:
21                merged.append(tokens[i])
22                i += 1
23        tokens = merged
24    return merges
25
26merges = train_bpe(["lower", "lowest", "newer"], num_merges=5)
27print(merges)   # e.g. [('e','r'), ('l','o'), ('lo','w'), ...]
5Interview Questions
  1. Conceptual: Why use subword tokenization instead of words or characters? (Words explode the vocabulary and hit OOV; characters make sequences too long. Subwords balance vocabulary size, sequence length, and never produce OOV.)
  2. Implementation: What does one BPE training step do? (Count adjacent symbol-pair frequencies and merge the most frequent pair into a new vocabulary symbol.)
  3. Applied: Why does token count, not word count, matter for context limits and cost? (The model and pricing operate on tokens; a rare word may be several tokens, so token count determines how much fits in context and what you pay.)
  4. Systems-level: What is byte-level BPE and what problem does it solve? (BPE over raw bytes instead of characters — it can represent any Unicode text, including emoji and rare scripts, with no unknown tokens.)
  5. Failure modes: How does BPE vs. WordPiece differ? (Both are subword; BPE merges by raw pair frequency, WordPiece merges by which pair most increases corpus likelihood under a language model.)
6Retrieval Check

Without looking: describe the BPE training loop in three steps, explain why subword beats word- and character-level, and say why token count drives cost and context. Check against Stage 3.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts