Back to Blog

Evaluation Metrics for NLP: Why Measuring Language Quality Is Hard

From BLEU and ROUGE to METEOR and perplexity, exploring the automatic metrics that score NLP systems, their blind spots, and why human evaluation remains the gold standard.

2025-10-18
Share
Computational Linguisticsevaluation-metricsbleurougenlp

Terminology

Term Definition
BLEU Bilingual Evaluation Understudy: measures n-gram precision between machine output and reference translations, with a brevity penalty
ROUGE Recall-Oriented Understudy for Gisting Evaluation: measures n-gram recall between generated summaries and reference summaries
METEOR Metric for Evaluation of Translation with Explicit ORdering: extends BLEU with stemming, synonyms, and word order penalties
Perplexity $2^{H}$ where $H$ is cross-entropy: measures how "surprised" a language model is by test data. Lower is better
Precision Fraction of generated n-grams that appear in the reference: how much of the output is correct
Recall Fraction of reference n-grams that appear in the output: how much of the reference is captured
F1 Score Harmonic mean of precision and recall: $F_1 = \frac{2 \cdot P \cdot R}{P + R}$, balancing both concerns
BERTScore A metric that computes token-level cosine similarity between BERT embeddings of the candidate and reference, capturing semantic similarity beyond exact n-gram matches
Human Evaluation Having human judges rate outputs on dimensions like fluency, adequacy, and coherence; the gold standard but expensive and slow

What & Why

Evaluating generated text is fundamentally harder than evaluating classification or regression. There is no single correct answer. "The cat is on the mat" and "A feline rests atop the rug" convey the same meaning with zero word overlap. Any metric based on exact string matching will score the second translation as zero, which is clearly wrong.

This is why NLP has developed a family of automatic metrics, each with different strengths and blind spots. BLEU measures precision (how much of the output matches the reference). ROUGE measures recall (how much of the reference is captured). METEOR adds stemming and synonym matching. BERTScore uses neural embeddings to capture semantic similarity. None of them fully correlate with human judgment, but together they provide useful signals.

Understanding these metrics matters because they drive model development. Researchers optimize for BLEU in translation, ROUGE in summarization, and perplexity in language modeling. If the metric is flawed, the optimization target is flawed, and the resulting models have predictable failure modes. Knowing what each metric measures (and what it misses) is essential for interpreting NLP research and building production systems.

How It Works

BLEU (Precision-Focused)

BLEU computes modified n-gram precision for n = 1, 2, 3, 4, takes their geometric mean, and applies a brevity penalty:

$\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{4} \frac{1}{4} \log p_n\right)$

"Modified" precision clips each n-gram count by its maximum count in any reference, preventing a degenerate output like "the the the the" from scoring high against a reference containing "the."

ROUGE (Recall-Focused)

ROUGE-N measures n-gram recall: what fraction of reference n-grams appear in the output.

$\text{ROUGE-N} = \frac{\sum_{\text{ref}} \sum_{\text{n-gram} \in \text{ref}} \text{Count}_{\text{match}}(\text{n-gram})}{\sum_{\text{ref}} \sum_{\text{n-gram} \in \text{ref}} \text{Count}(\text{n-gram})}$

ROUGE-L uses the longest common subsequence (LCS) instead of fixed n-grams, capturing word order without requiring contiguous matches.

BLEU vs ROUGE: Precision vs Recall BLEU (Precision) "How much of the OUTPUT matches the reference?" ROUGE (Recall) "How much of the REFERENCE is captured in the output?" Both fail on paraphrases: Ref: "The cat sat on the mat" Out: "A feline rested on the rug" -> low BLEU and ROUGE

METEOR

METEOR improves on BLEU by:

  1. Matching words via exact match, stemming ("running" matches "ran"), and synonym lookup (WordNet)
  2. Computing an alignment between candidate and reference
  3. Penalizing fragmented matches (words matched but in wrong order)

METEOR correlates better with human judgment than BLEU, especially at the sentence level.

Perplexity

Perplexity measures how well a language model predicts held-out text:

$\text{PP} = 2^{-\frac{1}{N}\sum_{i=1}^{N}\log_2 P(w_i | w_{

A perplexity of 20 means the model is, on average, as uncertain as choosing uniformly among 20 words. Lower is better. Perplexity is used to evaluate language models themselves, not their generated outputs.

BERTScore

BERTScore computes token-level cosine similarity between BERT embeddings of the candidate and reference tokens, then takes the maximum similarity for each token (greedy matching). This captures semantic equivalence: "feline" and "cat" have high embedding similarity even though they share no characters.

Complexity Analysis

Metric Time Space Notes
BLEU $O(L \cdot R)$ $O(L)$ $L$ = output length, $R$ = reference length; n-gram counting
ROUGE-N $O(L \cdot R)$ $O(R)$ Similar to BLEU but recall-oriented
ROUGE-L (LCS) $O(L \cdot R)$ $O(L \cdot R)$ Dynamic programming for longest common subsequence
METEOR $O(L \cdot R \cdot S)$ $O(L + R)$ $S$ = synonym lookup cost per word
Perplexity $O(N \cdot d^2)$ $O(d)$ $N$ = test tokens, requires full model forward pass
BERTScore $O((L + R) \cdot d^2 + L \cdot R \cdot d)$ $O((L + R) \cdot d)$ BERT forward pass + pairwise cosine similarity matrix

BLEU and ROUGE are cheap (milliseconds per sentence). BERTScore requires a BERT forward pass per sentence pair, making it 100-1000x slower but significantly more accurate for semantic equivalence.

Implementation

ALGORITHM ComputeROUGE_N(candidate, reference, n)
INPUT: candidate: list of tokens, reference: list of tokens, n: integer (n-gram order)
OUTPUT: rougeN: float (recall score)

BEGIN
  // Extract n-grams from both
  candNgrams <- empty map
  FOR i FROM 0 TO LENGTH(candidate) - n DO
    ngram <- TUPLE(candidate[i .. i+n-1])
    candNgrams[ngram] <- candNgrams[ngram] + 1
  END FOR

  refNgrams <- empty map
  FOR i FROM 0 TO LENGTH(reference) - n DO
    ngram <- TUPLE(reference[i .. i+n-1])
    refNgrams[ngram] <- refNgrams[ngram] + 1
  END FOR

  // Count matches (clipped by candidate count)
  matchCount <- 0
  refTotal <- 0
  FOR EACH (ngram, refCount) IN refNgrams DO
    candCount <- candNgrams[ngram] IF ngram IN candNgrams ELSE 0
    matchCount <- matchCount + MIN(candCount, refCount)
    refTotal <- refTotal + refCount
  END FOR

  IF refTotal = 0 THEN RETURN 0

  rougeN <- matchCount / refTotal   // recall
  RETURN rougeN
END
ALGORITHM ComputeROUGE_L(candidate, reference)
INPUT: candidate: list of tokens, reference: list of tokens
OUTPUT: rougeL: float (F1 based on LCS)

BEGIN
  m <- LENGTH(candidate)
  n <- LENGTH(reference)

  // LCS via dynamic programming
  dp <- CREATE 2D array [(m+1) x (n+1)], all zeros
  FOR i FROM 1 TO m DO
    FOR j FROM 1 TO n DO
      IF candidate[i-1] = reference[j-1] THEN
        dp[i][j] <- dp[i-1][j-1] + 1
      ELSE
        dp[i][j] <- MAX(dp[i-1][j], dp[i][j-1])
      END IF
    END FOR
  END FOR

  lcsLen <- dp[m][n]

  IF lcsLen = 0 THEN RETURN 0

  precision <- lcsLen / m
  recall <- lcsLen / n
  f1 <- (2 * precision * recall) / (precision + recall)

  RETURN f1
END
ALGORITHM ComputePerplexity(model, testTokens)
INPUT: model: language model, testTokens: list of N token indices
OUTPUT: perplexity: float

BEGIN
  totalLogProb <- 0

  FOR i FROM 1 TO N - 1 DO
    logProbs <- model.LOG_PROBS(testTokens[0 .. i-1])
    totalLogProb <- totalLogProb + logProbs[testTokens[i]]
  END FOR

  avgNegLogProb <- -totalLogProb / (N - 1)
  perplexity <- EXP(avgNegLogProb)    // using natural log; or 2^(H) with log base 2

  RETURN perplexity
END

Real-World Applications

  • Machine translation benchmarks: BLEU is the standard metric for WMT (Workshop on Machine Translation) competitions, despite known limitations
  • Summarization evaluation: ROUGE-1, ROUGE-2, and ROUGE-L are the standard metrics for evaluating abstractive and extractive summarization systems
  • Language model development: Perplexity on held-out data is the primary metric for comparing language model architectures and training recipes
  • Chatbot evaluation: Production chatbots use a combination of BERTScore (semantic accuracy), human ratings (fluency, helpfulness), and task-specific metrics (resolution rate)
  • LLM leaderboards: Benchmarks like MMLU, HumanEval, and MT-Bench combine automatic metrics with human preference ratings to rank models
  • Content moderation: Automatic metrics flag generated content that deviates significantly from expected quality baselines, triggering human review

Key Takeaways

  • BLEU measures n-gram precision (how much of the output is correct), ROUGE measures n-gram recall (how much of the reference is captured), and neither handles paraphrases well
  • METEOR improves on BLEU by incorporating stemming, synonyms, and word order penalties, correlating better with human judgment
  • Perplexity measures how well a language model predicts held-out text and is the standard metric for comparing language models, but it does not directly measure generation quality
  • BERTScore uses neural embeddings to capture semantic similarity beyond exact string matching, bridging the gap between cheap n-gram metrics and expensive human evaluation
  • No automatic metric fully replaces human evaluation: all metrics are proxies, and understanding their blind spots is essential for interpreting NLP research