Notes/Evaluation Metrics for NLP: Why Measuring Language Quality Is Hard

Evaluation Metrics for NLP: Why Measuring Language Quality Is Hard

From BLEU and ROUGE to METEOR and perplexity, exploring the automatic metrics that score NLP systems, their blind spots, and why human evaluation remains the gold standard.

2025-10-18AI-Synthesized from Personal Notes

Computational LinguisticsEvaluation MetricsBleuRougeNlp

Terminology

What & Why

Evaluating generated text is fundamentally harder than evaluating classification or regression. There is no single correct answer. "The cat is on the mat" and "A feline rests atop the rug" convey the same meaning with zero word overlap. Any metric based on exact string matching will score the second translation as zero, which is clearly wrong.

This is why NLP has developed a family of automatic metrics, each with different strengths and blind spots. BLEU measures precision (how much of the output matches the reference). ROUGE measures recall (how much of the reference is captured). METEOR adds stemming and synonym matching. BERTScore uses neural embeddings to capture semantic similarity. None of them fully correlate with human judgment, but together they provide useful signals.

Understanding these metrics matters because they drive model development. Researchers optimize for BLEU in translation, ROUGE in summarization, and perplexity in language modeling. If the metric is flawed, the optimization target is flawed, and the resulting models have predictable failure modes. Knowing what each metric measures (and what it misses) is essential for interpreting NLP research and building production systems.

How It Works

BLEU (Precision-Focused)

BLEU computes modified n-gram precision for $n = 1, 2, 3, 4$, takes their geometric mean, and applies a brevity penalty:

$\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{4} \frac{1}{4} \log p_n\right)$

"Modified" precision clips each n-gram count by its maximum count in any reference, preventing a degenerate output like "the the the the" from scoring high against a reference containing "the."

ROUGE (Recall-Focused)

ROUGE-N measures n-gram recall: what fraction of reference n-grams appear in the output.

$\text{ROUGE-N} = \frac{\sum_{\text{ref}} \sum_{\text{n-gram} \in \text{ref}} \text{Count}_{\text{match}}(\text{n-gram})}{\sum_{\text{ref}} \sum_{\text{n-gram} \in \text{ref}} \text{Count}(\text{n-gram})}$

ROUGE-L uses the longest common subsequence (LCS) instead of fixed n-grams, capturing word order without requiring contiguous matches.

METEOR

METEOR improves on BLEU by:

Matching words via exact match, stemming ("running" matches "ran"), and synonym lookup (WordNet)
Computing an alignment between candidate and reference
Penalizing fragmented matches (words matched but in wrong order)

METEOR correlates better with human judgment than BLEU, especially at the sentence level.

Perplexity

Perplexity measures how well a language model predicts held-out text:

$\text{PP} = 2^{-\frac{1}{N}\sum_{i=1}^{N}\log_2 P(w_i \mid w_{<i})}$

A perplexity of 20 means the model is, on average, as uncertain as choosing uniformly among 20 words. Lower is better. Perplexity is used to evaluate language models themselves, not their generated outputs.

BERTScore

BERTScore computes token-level cosine similarity between BERT embeddings of the candidate and reference tokens, then takes the maximum similarity for each token (greedy matching). This captures semantic equivalence: "feline" and "cat" have high embedding similarity even though they share no characters.

Complexity Analysis

BLEU and ROUGE are cheap (milliseconds per sentence). BERTScore requires a BERT forward pass per sentence pair, making it 100-1000x slower but significantly more accurate for semantic equivalence.

Implementation

ALGORITHM ComputeROUGE_N(candidate, reference, n)
INPUT: candidate: list of tokens, reference: list of tokens, n: integer (n-gram order)
OUTPUT: rougeN: float (recall score)

BEGIN
  // Extract n-grams from both
  candNgrams <- empty map
  FOR i FROM 0 TO LENGTH(candidate) - n DO
    ngram <- TUPLE(candidate[i .. i+n-1])
    candNgrams[ngram] <- candNgrams[ngram] + 1
  END FOR

  refNgrams <- empty map
  FOR i FROM 0 TO LENGTH(reference) - n DO
    ngram <- TUPLE(reference[i .. i+n-1])
    refNgrams[ngram] <- refNgrams[ngram] + 1
  END FOR

  // Count matches (clipped by candidate count)
  matchCount <- 0
  refTotal <- 0
  FOR EACH (ngram, refCount) IN refNgrams DO
    candCount <- candNgrams[ngram] IF ngram IN candNgrams ELSE 0
    matchCount <- matchCount + MIN(candCount, refCount)
    refTotal <- refTotal + refCount
  END FOR

  IF refTotal = 0 THEN RETURN 0

  rougeN <- matchCount / refTotal   // recall
  RETURN rougeN
END

ALGORITHM ComputeROUGE_L(candidate, reference)
INPUT: candidate: list of tokens, reference: list of tokens
OUTPUT: rougeL: float (F1 based on LCS)

BEGIN
  m <- LENGTH(candidate)
  n <- LENGTH(reference)

  // LCS via dynamic programming
  dp <- CREATE 2D array [(m+1) x (n+1)], all zeros
  FOR i FROM 1 TO m DO
    FOR j FROM 1 TO n DO
      IF candidate[i-1] = reference[j-1] THEN
        dp[i][j] <- dp[i-1][j-1] + 1
      ELSE
        dp[i][j] <- MAX(dp[i-1][j], dp[i][j-1])
      END IF
    END FOR
  END FOR

  lcsLen <- dp[m][n]

  IF lcsLen = 0 THEN RETURN 0

  precision <- lcsLen / m
  recall <- lcsLen / n
  f1 <- (2 * precision * recall) / (precision + recall)

  RETURN f1
END

ALGORITHM ComputePerplexity(model, testTokens)
INPUT: model: language model, testTokens: list of N token indices
OUTPUT: perplexity: float

BEGIN
  totalLogProb <- 0

  FOR i FROM 1 TO N - 1 DO
    logProbs <- model.LOG_PROBS(testTokens[0 .. i-1])
    totalLogProb <- totalLogProb + logProbs[testTokens[i]]
  END FOR

  avgNegLogProb <- -totalLogProb / (N - 1)
  perplexity <- EXP(avgNegLogProb)    // using natural log; or 2^(H) with log base 2

  RETURN perplexity
END

Real-World Applications

Key Takeaways

2025-10-19

Ethics and Bias in Language Models: What Training Data Teaches Machines About Us

How training data bias, toxicity, and hallucination emerge in language models, what debiasing techniques exist, and why responsible deployment requires more than better algorithms.

2025-10-08

Tokenization and Text Normalization: Why GPT Does Not See Words

From whitespace splitting to Byte Pair Encoding and SentencePiece, exploring how machines break text into processable units and why subword tokenization changed NLP forever.

2025-10-09

N-Grams and Language Models: Predicting the Next Word with Counting

From unigrams to trigrams, exploring how the Markov assumption turns word prediction into a counting problem, and why smoothing techniques like Kneser-Ney keep probabilities from collapsing to zero.