Tokenization and Text Normalization: Why GPT Does Not See Words
From whitespace splitting to Byte Pair Encoding and SentencePiece, exploring how machines break text into processable units and why subword tokenization changed NLP forever.
Terminology
| Term | Definition |
|---|---|
| Token | The atomic unit a language model operates on: could be a word, subword fragment, or single character |
| Tokenization | The process of splitting raw text into a sequence of tokens for model consumption |
| Text Normalization | Preprocessing steps (lowercasing, removing accents, standardizing punctuation) applied before tokenization |
| Vocabulary (Vocab) | The fixed set of all tokens a model recognizes, typically sized between 30k and 100k entries |
| Byte Pair Encoding (BPE) | A subword tokenization algorithm that iteratively merges the most frequent adjacent pair of symbols until a target vocabulary size is reached |
| SentencePiece | A language-independent tokenizer that operates directly on raw Unicode text without requiring pre-tokenization by whitespace |
| OOV (Out-of-Vocabulary) | A word not present in the model's vocabulary, historically replaced by a special UNK token |
| Subword Tokenization | A strategy that splits rare words into smaller known fragments while keeping frequent words intact, eliminating OOV problems |
| Unigram Model | A tokenization method that starts with a large vocabulary and iteratively removes tokens whose loss has the least impact on the overall likelihood |
What & Why
Every language model, from a simple spam filter to GPT, needs to convert raw text into numbers before it can do anything. The question is: what counts as a unit? Splitting on whitespace gives you words, but "unhappiness" and "happy" share no connection. Character-level models see every letter but lose the concept of meaning. Subword tokenization sits in the sweet spot, breaking text into pieces that balance vocabulary size against semantic coverage.
The reason this matters: vocabulary size directly controls model size, memory usage, and the ability to handle unseen words. Word-level tokenizers need enormous vocabularies (100k+ entries) and still fail on typos, new slang, or morphologically rich languages like Finnish or Turkish. Character-level tokenizers have tiny vocabularies but produce very long sequences, making attention mechanisms expensive. BPE and its variants solved this by learning a vocabulary of common subword pieces from data, giving models the ability to represent any string while keeping sequence lengths manageable.
This is why GPT does not see "words." It sees subword tokens like ["un", "happiness"] or ["token", "ization"]. Understanding tokenization is the first step to understanding how any modern NLP system processes language.
How It Works
Word-Level Tokenization
The simplest approach: split on whitespace and punctuation. "The cat sat." becomes ["The", "cat", "sat", "."]. This works for English but fails for languages without clear word boundaries (Chinese, Japanese, Thai). It also creates a massive vocabulary and cannot handle misspellings or novel words.
Character-Level Tokenization
Split every character into its own token. "cat" becomes ["c", "a", "t"]. The vocabulary is tiny (a few hundred entries for Unicode blocks), but sequences become very long. A 500-word document might produce 2500+ tokens, and the model must learn to compose meaning from individual characters.
Subword Tokenization (BPE)
BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair. After enough merges, common words like "the" remain as single tokens while rare words like "tokenization" get split into learned fragments.
SentencePiece
SentencePiece treats the input as a raw byte stream, including whitespace as a visible character (often represented as a special underscore). This means it does not need language-specific pre-tokenization rules. It supports both BPE and unigram model algorithms internally. Models like T5, LLaMA, and ALBERT use SentencePiece because it works identically across all languages without any preprocessing pipeline.
WordPiece
WordPiece (used by BERT) is similar to BPE but selects merges based on the pair that maximizes the likelihood of the training data rather than raw frequency. It prefixes continuation fragments with "##", so "tokenization" might become ["token", "##ization"]. The difference from BPE is subtle but affects which merges get prioritized.
Complexity Analysis
| Operation | Time | Space | Notes |
|---|---|---|---|
| BPE training (learning merges) | $O(N \cdot V)$ | $O(V)$ | $N$ = corpus size, $V$ = target vocab size (number of merge iterations) |
| BPE encoding (applying merges) | $O(n \cdot m)$ | $O(n)$ | $n$ = input length, $m$ = number of merge rules |
| Unigram tokenization (Viterbi) | $O(n \cdot k)$ | $O(n)$ | $k$ = max token length, finds optimal segmentation via dynamic programming |
| Word-level tokenization | $O(n)$ | $O(n)$ | Simple split on whitespace/punctuation |
| Character-level tokenization | $O(n)$ | $O(n)$ | Trivial, but output length equals input length |
The key trade-off is between vocabulary size $V$ and average sequence length $L$. Larger $V$ means shorter sequences (fewer tokens per sentence) but a bigger embedding matrix of size $V \times d$ where $d$ is the embedding dimension. The total embedding memory is:
For GPT-2 with $V = 50{,}257$ and $d = 768$: roughly 150 MB in float32. Doubling the vocabulary doubles this cost.
Implementation
ALGORITHM TrainBPE(corpus, numMerges)
INPUT: corpus: list of strings, numMerges: integer (target number of merges)
OUTPUT: mergeRules: ordered list of (pair -> merged symbol)
BEGIN
// Initialize: split every word into characters
vocab <- empty frequency map
FOR EACH word IN corpus DO
chars <- SPLIT word INTO individual characters
APPEND end-of-word marker to chars
vocab[chars] <- vocab[chars] + frequency of word
END FOR
mergeRules <- empty list
FOR i FROM 1 TO numMerges DO
// Count all adjacent pairs
pairCounts <- empty map
FOR EACH (tokenSequence, freq) IN vocab DO
FOR j FROM 0 TO LENGTH(tokenSequence) - 2 DO
pair <- (tokenSequence[j], tokenSequence[j+1])
pairCounts[pair] <- pairCounts[pair] + freq
END FOR
END FOR
IF pairCounts is empty THEN BREAK
bestPair <- pair with maximum count in pairCounts
mergedSymbol <- CONCATENATE bestPair.first + bestPair.second
// Apply merge to all sequences in vocab
newVocab <- empty map
FOR EACH (tokenSequence, freq) IN vocab DO
newSeq <- REPLACE all occurrences of bestPair in tokenSequence with mergedSymbol
newVocab[newSeq] <- freq
END FOR
vocab <- newVocab
APPEND (bestPair -> mergedSymbol) TO mergeRules
END FOR
RETURN mergeRules
END
ALGORITHM ApplyBPE(text, mergeRules)
INPUT: text: string, mergeRules: ordered list of merge rules
OUTPUT: tokens: list of subword strings
BEGIN
tokens <- SPLIT text INTO individual characters
APPEND end-of-word marker after each word boundary
FOR EACH (pair -> merged) IN mergeRules DO
i <- 0
WHILE i < LENGTH(tokens) - 1 DO
IF tokens[i] = pair.first AND tokens[i+1] = pair.second THEN
REPLACE tokens[i] WITH merged
REMOVE tokens[i+1]
ELSE
i <- i + 1
END IF
END WHILE
END FOR
RETURN tokens
END
Real-World Applications
- GPT family (OpenAI): Uses byte-level BPE with a vocabulary of ~50k tokens, operating on UTF-8 bytes so it can tokenize any language or even binary data
- BERT (Google): Uses WordPiece with a 30k vocabulary, prefixing subword continuations with "##" to distinguish word starts from fragments
- T5 and LLaMA: Use SentencePiece with unigram or BPE mode, treating whitespace as a regular character for true language independence
- Search engines: Apply text normalization (stemming, lowercasing, accent removal) before indexing to match queries like "running" to documents containing "ran"
- Machine translation: Subword tokenization lets models handle morphologically rich languages (Finnish, Turkish, German compounds) without exploding vocabulary sizes
- Code models (Codex, StarCoder): Tokenizers trained on source code learn to keep common keywords and operators as single tokens while splitting rare identifiers into subwords
Key Takeaways
- Tokenization is the first and most consequential preprocessing step in any NLP pipeline: it determines what the model can and cannot represent
- Word-level tokenization creates OOV problems; character-level creates long sequences; subword methods like BPE balance both
- BPE learns merge rules from data by iteratively combining the most frequent adjacent pair, building a vocabulary bottom-up from characters
- SentencePiece removes the need for language-specific preprocessing by operating directly on raw Unicode byte streams
- Vocabulary size is a direct trade-off: larger vocabularies mean shorter sequences but bigger embedding matrices and more memory