Notes/Tokenization and Text Normalization: Why GPT Does Not See Words

Tokenization and Text Normalization: Why GPT Does Not See Words

From whitespace splitting to Byte Pair Encoding and SentencePiece, exploring how machines break text into processable units and why subword tokenization changed NLP forever.

2025-10-08AI-Synthesized from Personal Notes

Computational LinguisticsTokenizationBpeNlp

Terminology

Term	Definition
Token	The atomic unit a language model operates on: could be a word, subword fragment, or single character
Tokenization	The process of splitting raw text into a sequence of tokens for model consumption
Text Normalization	Preprocessing steps (lowercasing, removing accents, standardizing punctuation) applied before tokenization
Vocabulary (Vocab)	The fixed set of all tokens a model recognizes, typically sized between 30k and 100k entries
Byte Pair Encoding (BPE)	A subword tokenization algorithm that iteratively merges the most frequent adjacent pair of symbols until a target vocabulary size is reached
SentencePiece	A language-independent tokenizer that operates directly on raw Unicode text without requiring pre-tokenization by whitespace
OOV (Out-of-Vocabulary)	A word not present in the model's vocabulary, historically replaced by a special UNK token
Subword Tokenization	A strategy that splits rare words into smaller known fragments while keeping frequent words intact, eliminating OOV problems
Unigram Model	A tokenization method that starts with a large vocabulary and iteratively removes tokens whose loss has the least impact on the overall likelihood

What & Why

Every language model, from a simple spam filter to GPT, needs to convert raw text into numbers before it can do anything. The question is: what counts as a unit? Splitting on whitespace gives you words, but "unhappiness" and "happy" share no connection. Character-level models see every letter but lose the concept of meaning. Subword tokenization sits in the sweet spot, breaking text into pieces that balance vocabulary size against semantic coverage.

The reason this matters: vocabulary size directly controls model size, memory usage, and the ability to handle unseen words. Word-level tokenizers need enormous vocabularies (100k+ entries) and still fail on typos, new slang, or morphologically rich languages like Finnish or Turkish. Character-level tokenizers have tiny vocabularies but produce very long sequences, making attention mechanisms expensive. BPE and its variants solved this by learning a vocabulary of common subword pieces from data, giving models the ability to represent any string while keeping sequence lengths manageable.

This is why GPT does not see "words." It sees subword tokens like ["un", "happiness"] or ["token", "ization"]. Understanding tokenization is the first step to understanding how any modern NLP system processes language.

How It Works

Word-Level Tokenization

The simplest approach: split on whitespace and punctuation. "The cat sat." becomes ["The", "cat", "sat", "."]. This works for English but fails for languages without clear word boundaries (Chinese, Japanese, Thai). It also creates a massive vocabulary and cannot handle misspellings or novel words.

Character-Level Tokenization

Split every character into its own token. "cat" becomes ["c", "a", "t"]. The vocabulary is tiny (a few hundred entries for Unicode blocks), but sequences become very long. A 500-word document might produce 2500+ tokens, and the model must learn to compose meaning from individual characters.

Subword Tokenization (BPE)

BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair. After enough merges, common words like "the" remain as single tokens while rare words like "tokenization" get split into learned fragments.

SentencePiece

SentencePiece treats the input as a raw byte stream, including whitespace as a visible character (often represented as a special underscore). This means it does not need language-specific pre-tokenization rules. It supports both BPE and unigram model algorithms internally. Models like T5, LLaMA, and ALBERT use SentencePiece because it works identically across all languages without any preprocessing pipeline.

WordPiece

WordPiece (used by BERT) is similar to BPE but selects merges based on the pair that maximizes the likelihood of the training data rather than raw frequency. It prefixes continuation fragments with "##", so "tokenization" might become ["token", "##ization"]. The difference from BPE is subtle but affects which merges get prioritized.

Complexity Analysis

Operation	Time	Space	Notes
BPE training (learning merges)	$O(N \cdot V)$	$O(V)$	$N$ = corpus size, $V$ = target vocab size (number of merge iterations)
BPE encoding (applying merges)	$O(n \cdot m)$	$O(n)$	$n$ = input length, $m$ = number of merge rules
Unigram tokenization (Viterbi)	$O(n \cdot k)$	$O(n)$	$k$ = max token length, finds optimal segmentation via dynamic programming
Word-level tokenization	$O(n)$	$O(n)$	Simple split on whitespace/punctuation
Character-level tokenization	$O(n)$	$O(n)$	Trivial, but output length equals input length

The key trade-off is between vocabulary size $V$ and average sequence length $L$. Larger $V$ means shorter sequences (fewer tokens per sentence) but a bigger embedding matrix of size $V \times d$ where $d$ is the embedding dimension. The total embedding memory is:

$\text{Embedding memory} = V \times d \times \text{sizeof(float)}$

For GPT-2 with $V = 50{,}257$ and $d = 768$: roughly 150 MB in float32. Doubling the vocabulary doubles this cost.

Implementation

ALGORITHM TrainBPE(corpus, numMerges)
INPUT: corpus: list of strings, numMerges: integer (target number of merges)
OUTPUT: mergeRules: ordered list of (pair -> merged symbol)

BEGIN
  // Initialize: split every word into characters
  vocab <- empty frequency map
  FOR EACH word IN corpus DO
    chars <- SPLIT word INTO individual characters
    APPEND end-of-word marker to chars
    vocab[chars] <- vocab[chars] + frequency of word
  END FOR

  mergeRules <- empty list

  FOR i FROM 1 TO numMerges DO
    // Count all adjacent pairs
    pairCounts <- empty map
    FOR EACH (tokenSequence, freq) IN vocab DO
      FOR j FROM 0 TO LENGTH(tokenSequence) - 2 DO
        pair <- (tokenSequence[j], tokenSequence[j+1])
        pairCounts[pair] <- pairCounts[pair] + freq
      END FOR
    END FOR

    IF pairCounts is empty THEN BREAK

    bestPair <- pair with maximum count in pairCounts
    mergedSymbol <- CONCATENATE bestPair.first + bestPair.second

    // Apply merge to all sequences in vocab
    newVocab <- empty map
    FOR EACH (tokenSequence, freq) IN vocab DO
      newSeq <- REPLACE all occurrences of bestPair in tokenSequence with mergedSymbol
      newVocab[newSeq] <- freq
    END FOR

    vocab <- newVocab
    APPEND (bestPair -> mergedSymbol) TO mergeRules
  END FOR

  RETURN mergeRules
END

ALGORITHM ApplyBPE(text, mergeRules)
INPUT: text: string, mergeRules: ordered list of merge rules
OUTPUT: tokens: list of subword strings

BEGIN
  tokens <- SPLIT text INTO individual characters
  APPEND end-of-word marker after each word boundary

  FOR EACH (pair -> merged) IN mergeRules DO
    i <- 0
    WHILE i < LENGTH(tokens) - 1 DO
      IF tokens[i] = pair.first AND tokens[i+1] = pair.second THEN
        REPLACE tokens[i] WITH merged
        REMOVE tokens[i+1]
      ELSE
        i <- i + 1
      END IF
    END WHILE
  END FOR

  RETURN tokens
END

Real-World Applications

GPT family (OpenAI): Uses byte-level BPE with a vocabulary of ~50k tokens, operating on UTF-8 bytes so it can tokenize any language or even binary data
BERT (Google): Uses WordPiece with a 30k vocabulary, prefixing subword continuations with "##" to distinguish word starts from fragments
T5 and LLaMA: Use SentencePiece with unigram or BPE mode, treating whitespace as a regular character for true language independence
Search engines: Apply text normalization (stemming, lowercasing, accent removal) before indexing to match queries like "running" to documents containing "ran"
Machine translation: Subword tokenization lets models handle morphologically rich languages (Finnish, Turkish, German compounds) without exploding vocabulary sizes
Code models (Codex, StarCoder): Tokenizers trained on source code learn to keep common keywords and operators as single tokens while splitting rare identifiers into subwords

Key Takeaways

Tokenization is the first and most consequential preprocessing step in any NLP pipeline: it determines what the model can and cannot represent
Word-level tokenization creates OOV problems; character-level creates long sequences; subword methods like BPE balance both
BPE learns merge rules from data by iteratively combining the most frequent adjacent pair, building a vocabulary bottom-up from characters
SentencePiece removes the need for language-specific preprocessing by operating directly on raw Unicode byte streams
Vocabulary size is a direct trade-off: larger vocabularies mean shorter sequences but bigger embedding matrices and more memory

2025-10-09

N-Grams and Language Models: Predicting the Next Word with Counting

From unigrams to trigrams, exploring how the Markov assumption turns word prediction into a counting problem, and why smoothing techniques like Kneser-Ney keep probabilities from collapsing to zero.

2025-10-10

TF-IDF and Bag of Words: Why Keyword Search Still Works

How term frequency and inverse document frequency turn documents into weighted vectors, powering search engines and text classification long before deep learning arrived.

2025-10-11

Word Embeddings: How King Minus Man Plus Woman Equals Queen

From one-hot vectors to Word2Vec and GloVe, exploring how the distributional hypothesis lets us learn dense vector representations where arithmetic on words captures meaning.