Notes/Sequence Labeling: POS Tagging, NER, and the Viterbi Algorithm

Sequence Labeling: POS Tagging, NER, and the Viterbi Algorithm

How Hidden Markov Models and BIO tagging turn the problem of labeling every word in a sentence into a tractable dynamic programming problem, powering part-of-speech tagging and named entity recognition.

2025-10-12AI-Synthesized from Personal Notes

Computational LinguisticsSequence LabelingNerHmmNlp

Terminology

Term	Definition
Sequence Labeling	The task of assigning a categorical label to each element in an input sequence (e.g., a tag to each word in a sentence)
POS Tagging	Part-of-speech tagging: labeling each word with its grammatical role (noun, verb, adjective, etc.)
Named Entity Recognition (NER)	Identifying and classifying named entities (persons, organizations, locations, dates) in text
BIO Tagging	A labeling scheme where B = Beginning of entity, I = Inside (continuation), O = Outside (not an entity)
Hidden Markov Model (HMM)	A probabilistic model with hidden states (tags) that emit observable symbols (words), connected by transition probabilities
Transition Probability	$P(t_i \| t_{i-1})$: the probability of moving from one tag to another in the HMM state sequence
Emission Probability	$P(w_i \| t_i)$: the probability that a given tag generates a particular word
Viterbi Algorithm	A dynamic programming algorithm that finds the most probable sequence of hidden states given the observations, running in $O(T \cdot S^2)$ time
CRF (Conditional Random Field)	A discriminative model that directly models $P(\text{tags} \| \text{words})$, often outperforming HMMs by conditioning on the full input

What & Why

Many NLP tasks require labeling every token in a sequence. POS tagging assigns grammatical categories: "The/DET cat/NN sat/VBD." NER identifies entities: "Barack/B-PER Obama/I-PER visited/O Paris/B-LOC." These labels are the foundation for downstream tasks like information extraction, question answering, and syntactic parsing.

The challenge is that labels are not independent. In English, a determiner (DET) is almost always followed by a noun (NN) or adjective (JJ), never by another determiner. In NER, a B-PER tag is likely followed by I-PER, not by B-LOC. Sequence labeling models must capture these dependencies between adjacent labels.

Hidden Markov Models were the first successful approach: model the tag sequence as a Markov chain with transition probabilities, and model each word as an emission from its tag. The Viterbi algorithm then finds the globally optimal tag sequence in polynomial time using dynamic programming. While neural models (BiLSTM-CRF, BERT fine-tuning) have largely replaced HMMs in practice, the Viterbi algorithm remains the decoding backbone of modern CRF layers.

How It Works

The BIO Tagging Scheme

BIO converts entity spans into per-token labels:

B-TYPE: first token of an entity of type TYPE
I-TYPE: continuation token of the same entity
O: not part of any entity

Example: "Barack Obama visited Paris" becomes:

Barack/B-PER Obama/I-PER visited/O Paris/B-LOC

This scheme handles adjacent entities of the same type by requiring a new B tag at each entity boundary.

Hidden Markov Model

An HMM for POS tagging has:

Hidden states: the tag set (DET, NN, VB, JJ, ...)
Observations: the words
Transition matrix $A$: $A[i][j] = P(t_j | t_i)$
Emission matrix $B$: $B[t][w] = P(w | t)$
Initial distribution $\pi$: $\pi[t] = P(t_1 = t)$

Given a sentence $w_1, w_2, \ldots, w_T$, the goal is to find the tag sequence $t_1, t_2, \ldots, t_T$ that maximizes:

$P(t_1 \ldots t_T | w_1 \ldots w_T) \propto \pi(t_1) \cdot B[t_1][w_1] \cdot \prod_{i=2}^{T} A[t_{i-1}][t_i] \cdot B[t_i][w_i]$

The Viterbi Algorithm

Brute-force search over all possible tag sequences is $O(S^T)$ where $S$ is the number of tags. Viterbi reduces this to $O(T \cdot S^2)$ using dynamic programming. At each position $t$, for each tag $s$, it stores the probability of the best path ending in state $s$ at time $t$, along with a backpointer to the previous state.

Complexity Analysis

Operation	Time	Space	Notes
Viterbi decoding	$O(T \cdot S^2)$	$O(T \cdot S)$	$T$ = sequence length, $S$ = number of tags
HMM training (MLE)	$O(N)$	$O(S^2 + S \cdot V)$	$N$ = corpus tokens, count transitions and emissions
Brute-force decoding	$O(S^T)$	$O(T)$	Exponential, infeasible for real sentences
CRF training (per epoch)	$O(N \cdot S^2)$	$O(S^2 + \text{features})$	Forward-backward pass at each training example

For POS tagging with $S \approx 45$ tags (Penn Treebank tagset) and average sentence length $T \approx 25$, Viterbi performs roughly $25 \times 45^2 \approx 50{,}000$ operations per sentence, which is trivially fast.

Implementation

ALGORITHM Viterbi(words, tagSet, A, B, pi)
INPUT: words: list of T observed words
       tagSet: list of S possible tags
       A: transition matrix [S x S], A[i][j] = P(tag_j | tag_i)
       B: emission matrix [S x V], B[s][w] = P(word_w | tag_s)
       pi: initial distribution [S], pi[s] = P(first tag = s)
OUTPUT: bestPath: list of T tags (most probable tag sequence)

BEGIN
  T <- LENGTH(words)
  S <- LENGTH(tagSet)

  // dp[t][s] = log probability of best path ending in state s at time t
  dp <- CREATE 2D array [T x S], initialized to -INFINITY
  backptr <- CREATE 2D array [T x S]

  // Initialization (t = 0)
  FOR s FROM 0 TO S - 1 DO
    dp[0][s] <- LOG(pi[s]) + LOG(B[s][words[0]])
    backptr[0][s] <- -1
  END FOR

  // Recursion (t = 1 to T-1)
  FOR t FROM 1 TO T - 1 DO
    FOR s FROM 0 TO S - 1 DO
      bestPrev <- -1
      bestScore <- -INFINITY
      FOR sPrev FROM 0 TO S - 1 DO
        score <- dp[t-1][sPrev] + LOG(A[sPrev][s]) + LOG(B[s][words[t]])
        IF score > bestScore THEN
          bestScore <- score
          bestPrev <- sPrev
        END IF
      END FOR
      dp[t][s] <- bestScore
      backptr[t][s] <- bestPrev
    END FOR
  END FOR

  // Termination: find best final state
  bestFinal <- ARGMAX over s of dp[T-1][s]

  // Backtrace
  bestPath <- CREATE list of length T
  bestPath[T-1] <- tagSet[bestFinal]
  current <- bestFinal
  FOR t FROM T - 2 DOWNTO 0 DO
    current <- backptr[t+1][current]
    bestPath[t] <- tagSet[current]
  END FOR

  RETURN bestPath
END

ALGORITHM TrainHMM(taggedCorpus, tagSet, vocab)
INPUT: taggedCorpus: list of (word, tag) pairs grouped by sentence
       tagSet: list of S tags, vocab: list of V words
OUTPUT: A: transition matrix, B: emission matrix, pi: initial distribution

BEGIN
  transCount <- CREATE 2D array [S x S], all zeros
  emitCount <- CREATE 2D array [S x V], all zeros
  tagCount <- CREATE array [S], all zeros
  initCount <- CREATE array [S], all zeros

  FOR EACH sentence IN taggedCorpus DO
    initCount[INDEX(sentence[0].tag)] <- initCount[INDEX(sentence[0].tag)] + 1
    FOR i FROM 0 TO LENGTH(sentence) - 1 DO
      t <- INDEX(sentence[i].tag)
      w <- INDEX(sentence[i].word)
      tagCount[t] <- tagCount[t] + 1
      emitCount[t][w] <- emitCount[t][w] + 1
      IF i > 0 THEN
        tPrev <- INDEX(sentence[i-1].tag)
        transCount[tPrev][t] <- transCount[tPrev][t] + 1
      END IF
    END FOR
  END FOR

  // Normalize with add-one smoothing
  pi <- initCount / SUM(initCount)
  FOR s FROM 0 TO S - 1 DO
    A[s] <- (transCount[s] + 1) / (SUM(transCount[s]) + S)
    B[s] <- (emitCount[s] + 1) / (tagCount[s] + V)
  END FOR

  RETURN A, B, pi
END

Real-World Applications

Information extraction: NER identifies people, organizations, locations, and dates in news articles, legal documents, and medical records for structured database population
Search engines: POS tagging helps disambiguate queries ("book a flight" vs "read a book") and improves query understanding
Machine translation: POS tags guide word reordering between languages with different syntactic structures (e.g., English SVO vs Japanese SOV)
Clinical NLP: NER extracts drug names, dosages, and conditions from clinical notes, powering pharmacovigilance and clinical decision support
Financial NLP: Entity extraction identifies company names, ticker symbols, and monetary amounts in earnings reports and SEC filings
Voice assistants: Slot filling (a form of sequence labeling) extracts structured parameters from user utterances: "Book a [flight] from [New York] to [London] on [Friday]"

Key Takeaways

Sequence labeling assigns a tag to every token in a sequence, with POS tagging and NER being the two most common applications
BIO tagging converts entity span detection into a per-token classification problem, handling adjacent entities cleanly
HMMs model tag sequences as Markov chains with emission probabilities, capturing the dependency between adjacent labels
The Viterbi algorithm finds the globally optimal tag sequence in $O(T \cdot S^2)$ time using dynamic programming, avoiding the exponential cost of brute-force search
Modern systems use BiLSTM-CRF or fine-tuned transformers, but the CRF decoding layer still uses Viterbi internally

2025-10-13

Parsing and Syntax Trees: Where Compilers and NLP Share Roots

From constituency parsing to dependency parsing and the CYK algorithm, exploring how machines discover the grammatical structure of sentences using the same ideas that power programming language compilers.

2025-10-14

Semantic Similarity and Sentence Embeddings: From Cosine Distance to RAG Pipelines

How BERT embeddings and sentence transformers measure meaning similarity between texts, enabling semantic search, duplicate detection, and retrieval-augmented generation.

2025-10-15

Attention and Self-Attention Through the Linguistics Lens

How the attention mechanism lets transformers learn which words relate to which, why positional encoding replaces word order, and what self-attention looks like from a linguist's perspective.