Sequence Labeling: POS Tagging, NER, and the Viterbi Algorithm
How Hidden Markov Models and BIO tagging turn the problem of labeling every word in a sentence into a tractable dynamic programming problem, powering part-of-speech tagging and named entity recognition.
Terminology
| Term | Definition |
|---|---|
| Sequence Labeling | The task of assigning a categorical label to each element in an input sequence (e.g., a tag to each word in a sentence) |
| POS Tagging | Part-of-speech tagging: labeling each word with its grammatical role (noun, verb, adjective, etc.) |
| Named Entity Recognition (NER) | Identifying and classifying named entities (persons, organizations, locations, dates) in text |
| BIO Tagging | A labeling scheme where B = Beginning of entity, I = Inside (continuation), O = Outside (not an entity) |
| Hidden Markov Model (HMM) | A probabilistic model with hidden states (tags) that emit observable symbols (words), connected by transition probabilities |
| Transition Probability | $P(t_i | t_{i-1})$: the probability of moving from one tag to another in the HMM state sequence |
| Emission Probability | $P(w_i | t_i)$: the probability that a given tag generates a particular word |
| Viterbi Algorithm | A dynamic programming algorithm that finds the most probable sequence of hidden states given the observations, running in $O(T \cdot S^2)$ time |
| CRF (Conditional Random Field) | A discriminative model that directly models $P(\text{tags} | \text{words})$, often outperforming HMMs by conditioning on the full input |
What & Why
Many NLP tasks require labeling every token in a sequence. POS tagging assigns grammatical categories: "The/DET cat/NN sat/VBD." NER identifies entities: "Barack/B-PER Obama/I-PER visited/O Paris/B-LOC." These labels are the foundation for downstream tasks like information extraction, question answering, and syntactic parsing.
The challenge is that labels are not independent. In English, a determiner (DET) is almost always followed by a noun (NN) or adjective (JJ), never by another determiner. In NER, a B-PER tag is likely followed by I-PER, not by B-LOC. Sequence labeling models must capture these dependencies between adjacent labels.
Hidden Markov Models were the first successful approach: model the tag sequence as a Markov chain with transition probabilities, and model each word as an emission from its tag. The Viterbi algorithm then finds the globally optimal tag sequence in polynomial time using dynamic programming. While neural models (BiLSTM-CRF, BERT fine-tuning) have largely replaced HMMs in practice, the Viterbi algorithm remains the decoding backbone of modern CRF layers.
How It Works
The BIO Tagging Scheme
BIO converts entity spans into per-token labels:
- B-TYPE: first token of an entity of type TYPE
- I-TYPE: continuation token of the same entity
- O: not part of any entity
Example: "Barack Obama visited Paris" becomes:
Barack/B-PER Obama/I-PER visited/O Paris/B-LOC
This scheme handles adjacent entities of the same type by requiring a new B tag at each entity boundary.
Hidden Markov Model
An HMM for POS tagging has:
- Hidden states: the tag set (DET, NN, VB, JJ, ...)
- Observations: the words
- Transition matrix
A:A[i][j] = P(t_j | t_i) - Emission matrix
B:B[t][w] = P(w | t) - Initial distribution
\pi:\pi[t] = P(t_1 = t)
Given a sentence w_1, w_2, \ldots, w_T, the goal is to find the tag sequence t_1, t_2, \ldots, t_T that maximizes:
The Viterbi Algorithm
Brute-force search over all possible tag sequences is O(S^T) where S is the number of tags. Viterbi reduces this to O(T \cdot S^2) using dynamic programming. At each position t, for each tag s, it stores the probability of the best path ending in state s at time t, along with a backpointer to the previous state.
Complexity Analysis
| Operation | Time | Space | Notes |
|---|---|---|---|
| Viterbi decoding | $O(T \cdot S^2)$ | $O(T \cdot S)$ | $T$ = sequence length, $S$ = number of tags |
| HMM training (MLE) | $O(N)$ | $O(S^2 + S \cdot V)$ | $N$ = corpus tokens, count transitions and emissions |
| Brute-force decoding | $O(S^T)$ | $O(T)$ | Exponential, infeasible for real sentences |
| CRF training (per epoch) | $O(N \cdot S^2)$ | $O(S^2 + \text{features})$ | Forward-backward pass at each training example |
For POS tagging with $S \approx 45$ tags (Penn Treebank tagset) and average sentence length $T \approx 25$, Viterbi performs roughly $25 \times 45^2 \approx 50{,}000$ operations per sentence, which is trivially fast.
Implementation
ALGORITHM Viterbi(words, tagSet, A, B, pi)
INPUT: words: list of T observed words
tagSet: list of S possible tags
A: transition matrix [S x S], A[i][j] = P(tag_j | tag_i)
B: emission matrix [S x V], B[s][w] = P(word_w | tag_s)
pi: initial distribution [S], pi[s] = P(first tag = s)
OUTPUT: bestPath: list of T tags (most probable tag sequence)
BEGIN
T <- LENGTH(words)
S <- LENGTH(tagSet)
// dp[t][s] = log probability of best path ending in state s at time t
dp <- CREATE 2D array [T x S], initialized to -INFINITY
backptr <- CREATE 2D array [T x S]
// Initialization (t = 0)
FOR s FROM 0 TO S - 1 DO
dp[0][s] <- LOG(pi[s]) + LOG(B[s][words[0]])
backptr[0][s] <- -1
END FOR
// Recursion (t = 1 to T-1)
FOR t FROM 1 TO T - 1 DO
FOR s FROM 0 TO S - 1 DO
bestPrev <- -1
bestScore <- -INFINITY
FOR sPrev FROM 0 TO S - 1 DO
score <- dp[t-1][sPrev] + LOG(A[sPrev][s]) + LOG(B[s][words[t]])
IF score > bestScore THEN
bestScore <- score
bestPrev <- sPrev
END IF
END FOR
dp[t][s] <- bestScore
backptr[t][s] <- bestPrev
END FOR
END FOR
// Termination: find best final state
bestFinal <- ARGMAX over s of dp[T-1][s]
// Backtrace
bestPath <- CREATE list of length T
bestPath[T-1] <- tagSet[bestFinal]
current <- bestFinal
FOR t FROM T - 2 DOWNTO 0 DO
current <- backptr[t+1][current]
bestPath[t] <- tagSet[current]
END FOR
RETURN bestPath
END
ALGORITHM TrainHMM(taggedCorpus, tagSet, vocab)
INPUT: taggedCorpus: list of (word, tag) pairs grouped by sentence
tagSet: list of S tags, vocab: list of V words
OUTPUT: A: transition matrix, B: emission matrix, pi: initial distribution
BEGIN
transCount <- CREATE 2D array [S x S], all zeros
emitCount <- CREATE 2D array [S x V], all zeros
tagCount <- CREATE array [S], all zeros
initCount <- CREATE array [S], all zeros
FOR EACH sentence IN taggedCorpus DO
initCount[INDEX(sentence[0].tag)] <- initCount[INDEX(sentence[0].tag)] + 1
FOR i FROM 0 TO LENGTH(sentence) - 1 DO
t <- INDEX(sentence[i].tag)
w <- INDEX(sentence[i].word)
tagCount[t] <- tagCount[t] + 1
emitCount[t][w] <- emitCount[t][w] + 1
IF i > 0 THEN
tPrev <- INDEX(sentence[i-1].tag)
transCount[tPrev][t] <- transCount[tPrev][t] + 1
END IF
END FOR
END FOR
// Normalize with add-one smoothing
pi <- initCount / SUM(initCount)
FOR s FROM 0 TO S - 1 DO
A[s] <- (transCount[s] + 1) / (SUM(transCount[s]) + S)
B[s] <- (emitCount[s] + 1) / (tagCount[s] + V)
END FOR
RETURN A, B, pi
END
Real-World Applications
- Information extraction: NER identifies people, organizations, locations, and dates in news articles, legal documents, and medical records for structured database population
- Search engines: POS tagging helps disambiguate queries ("book a flight" vs "read a book") and improves query understanding
- Machine translation: POS tags guide word reordering between languages with different syntactic structures (e.g., English SVO vs Japanese SOV)
- Clinical NLP: NER extracts drug names, dosages, and conditions from clinical notes, powering pharmacovigilance and clinical decision support
- Financial NLP: Entity extraction identifies company names, ticker symbols, and monetary amounts in earnings reports and SEC filings
- Voice assistants: Slot filling (a form of sequence labeling) extracts structured parameters from user utterances: "Book a [flight] from [New York] to [London] on [Friday]"
Key Takeaways
- Sequence labeling assigns a tag to every token in a sequence, with POS tagging and NER being the two most common applications
- BIO tagging converts entity span detection into a per-token classification problem, handling adjacent entities cleanly
- HMMs model tag sequences as Markov chains with emission probabilities, capturing the dependency between adjacent labels
- The Viterbi algorithm finds the globally optimal tag sequence in $O(T \cdot S^2)$ time using dynamic programming, avoiding the exponential cost of brute-force search
- Modern systems use BiLSTM-CRF or fine-tuned transformers, but the CRF decoding layer still uses Viterbi internally