Notes/N-Grams and Language Models: Predicting the Next Word with Counting

N-Grams and Language Models: Predicting the Next Word with Counting

From unigrams to trigrams, exploring how the Markov assumption turns word prediction into a counting problem, and why smoothing techniques like Kneser-Ney keep probabilities from collapsing to zero.

2025-10-09AI-Synthesized from Personal Notes

Computational LinguisticsN GramsLanguage ModelsNlp

Terminology

Term	Definition
N-Gram	A contiguous sequence of $n$ tokens from a text: unigram ($n=1$), bigram ($n=2$), trigram ($n=3$)
Language Model (LM)	A probability distribution over sequences of tokens, assigning a likelihood to any given string of text
Markov Assumption	The simplifying assumption that the probability of a word depends only on the previous $n-1$ words, not the entire history
Perplexity	A measure of how well a language model predicts a test set; lower perplexity means better prediction. Equals $2^{H}$ where $H$ is the cross-entropy
Maximum Likelihood Estimation (MLE)	Estimating n-gram probabilities by dividing observed counts: $P(w_i \| w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}$
Laplace (Add-One) Smoothing	Adding 1 to every n-gram count to ensure no probability is zero: $P(w_i \| w_{i-1}) = \frac{C(w_{i-1}, w_i) + 1}{C(w_{i-1}) + V}$
Kneser-Ney Smoothing	An advanced smoothing method that discounts observed counts by a fixed $d$ and redistributes mass based on how many different contexts a word appears in
Backoff	A strategy where the model falls back to a shorter n-gram (e.g., bigram to unigram) when the longer n-gram has zero counts
Cross-Entropy	The average number of bits needed to encode test data using the model's probability distribution: $H = -\frac{1}{N}\sum \log_2 P(w_i \| \text{context})$

What & Why

Before neural networks took over NLP, the dominant approach to language modeling was counting. You observe a large corpus of text, count how often word sequences appear, and use those counts to estimate the probability of the next word. A bigram model asks: given the previous word, what comes next? A trigram model looks at the previous two words.

This works because of the Markov assumption: the probability of a word depends only on a small, fixed window of preceding words. The assumption is wrong (language has long-range dependencies), but it is useful enough to power spell checkers, speech recognition, and machine translation for decades.

The core challenge is sparsity. Most possible n-grams never appear in any training corpus. If you have never seen "the walrus danced," your model assigns it probability zero, which breaks everything. Smoothing techniques redistribute probability mass from observed n-grams to unseen ones, and the choice of smoothing method has a dramatic effect on model quality.

How It Works

The Chain Rule and Markov Simplification

The probability of a sentence $w_1, w_2, \ldots, w_n$ is:

$P(w_1 w_2 \ldots w_n) = \prod_{i=1}^{n} P(w_i | w_1 \ldots w_{i-1})$

The Markov assumption truncates the conditioning context. For a bigram model:

$P(w_i | w_1 \ldots w_{i-1}) \approx P(w_i | w_{i-1})$

For a trigram model:

$P(w_i | w_1 \ldots w_{i-1}) \approx P(w_i | w_{i-2}, w_{i-1})$

Counting and MLE

Bigram probabilities are estimated by counting:

$P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}$

If "the cat" appears 100 times and "the" appears 5000 times, then $P(\text{cat} | \text{the}) = 100/5000 = 0.02$.

Smoothing Techniques

Laplace (Add-One): Add 1 to every count. Simple but distorts the distribution heavily, stealing too much mass from frequent n-grams.

Add-k Smoothing: Add a fractional $k < 1$ instead of 1. Less aggressive but $k$ must be tuned.

Kneser-Ney Smoothing: The gold standard for n-gram models. It subtracts a fixed discount $d$ from each observed count and distributes the freed mass proportionally to a continuation probability: how many distinct contexts a word appears in. The word "Francisco" has high unigram frequency but almost always follows "San," so its continuation probability is low. Kneser-Ney captures this.

Complexity Analysis

Operation	Time	Space	Notes
Training (counting n-grams)	$O(N)$	$O(V^n)$	$N$ = corpus tokens, $V$ = vocab size, $n$ = gram order
Probability lookup	$O(1)$	$O(1)$	Hash table lookup of the n-gram count
Sentence scoring	$O(L)$	$O(1)$	$L$ = sentence length, one lookup per token
Perplexity computation	$O(N_{\text{test}})$	$O(1)$	One pass over the test set

The storage bottleneck is the n-gram table. For a trigram model with vocabulary $V = 50{,}000$, the theoretical maximum number of trigrams is $V^3 = 1.25 \times 10^{14}$. In practice, most trigrams never occur, so sparse storage (hash maps) reduces this dramatically.

Perplexity is computed as:

$\text{PP}(W) = 2^{H(W)} = 2^{-\frac{1}{N}\sum_{i=1}^{N}\log_2 P(w_i | w_{i-n+1} \ldots w_{i-1})}$

Implementation

ALGORITHM TrainBigramModel(corpus)
INPUT: corpus: list of sentences (each a list of tokens)
OUTPUT: bigramProb: map of (context, word) -> probability

BEGIN
  unigramCount <- empty map
  bigramCount <- empty map
  V <- 0  // vocabulary size

  FOR EACH sentence IN corpus DO
    tokens <- ["<s>"] + sentence + ["</s>"]
    FOR i FROM 0 TO LENGTH(tokens) - 1 DO
      unigramCount[tokens[i]] <- unigramCount[tokens[i]] + 1
      IF i > 0 THEN
        pair <- (tokens[i-1], tokens[i])
        bigramCount[pair] <- bigramCount[pair] + 1
      END IF
    END FOR
  END FOR

  V <- NUMBER OF DISTINCT KEYS IN unigramCount

  bigramProb <- empty map
  FOR EACH ((ctx, word), count) IN bigramCount DO
    bigramProb[(ctx, word)] <- count / unigramCount[ctx]
  END FOR

  RETURN bigramProb, unigramCount, V
END

ALGORITHM ComputePerplexity(testCorpus, bigramProb, unigramCount, V)
INPUT: testCorpus: list of sentences, bigramProb: map, unigramCount: map, V: integer
OUTPUT: perplexity: float

BEGIN
  logSum <- 0
  tokenCount <- 0

  FOR EACH sentence IN testCorpus DO
    tokens <- ["<s>"] + sentence + ["</s>"]
    FOR i FROM 1 TO LENGTH(tokens) - 1 DO
      pair <- (tokens[i-1], tokens[i])
      IF pair IN bigramProb THEN
        p <- bigramProb[pair]
      ELSE
        // Laplace smoothing fallback
        p <- 1 / (unigramCount[tokens[i-1]] + V)
      END IF
      logSum <- logSum + LOG2(p)
      tokenCount <- tokenCount + 1
    END FOR
  END FOR

  crossEntropy <- -logSum / tokenCount
  perplexity <- 2 ^ crossEntropy
  RETURN perplexity
END

Real-World Applications

Spell checking and autocorrect: N-gram models score candidate corrections by how likely the resulting sentence is, preferring "the cat sat" over "the bat sat" when context supports it
Speech recognition: N-gram language models rescore acoustic model outputs, choosing the word sequence with the highest combined probability
Machine translation (pre-neural): Statistical MT systems like Moses used n-gram language models as a fluency component alongside translation models
Text prediction on keyboards: Mobile keyboards use n-gram models (often trigram) to suggest the next word as you type
Spam filtering: Character-level n-grams detect patterns in spam text that word-level features miss, like obfuscated words ("v1agra")
Authorship attribution: Character n-gram distributions serve as stylistic fingerprints, distinguishing authors by their unconscious writing patterns

Key Takeaways

N-gram language models estimate the probability of the next word by counting how often word sequences appear in a training corpus
The Markov assumption limits context to $n-1$ previous words, making the model tractable but unable to capture long-range dependencies
Smoothing is essential because most possible n-grams never appear in training data, and zero probabilities break the entire model
Kneser-Ney smoothing outperforms simpler methods by using continuation probability: how many distinct contexts a word appears in, not just its raw frequency
Perplexity measures model quality: a perplexity of $k$ means the model is as uncertain as if it were choosing uniformly among $k$ words at each step

2025-10-10

TF-IDF and Bag of Words: Why Keyword Search Still Works

How term frequency and inverse document frequency turn documents into weighted vectors, powering search engines and text classification long before deep learning arrived.

2025-10-11

Word Embeddings: How King Minus Man Plus Woman Equals Queen

From one-hot vectors to Word2Vec and GloVe, exploring how the distributional hypothesis lets us learn dense vector representations where arithmetic on words captures meaning.

2025-10-12

Sequence Labeling: POS Tagging, NER, and the Viterbi Algorithm

How Hidden Markov Models and BIO tagging turn the problem of labeling every word in a sentence into a tractable dynamic programming problem, powering part-of-speech tagging and named entity recognition.