Evaluation Metrics for NLP: Why Measuring Language Quality Is Hard
From BLEU and ROUGE to METEOR and perplexity, exploring the automatic metrics that score NLP systems, their blind spots, and why human evaluation remains the gold standard.
Terminology
What & Why
Evaluating generated text is fundamentally harder than evaluating classification or regression. There is no single correct answer. "The cat is on the mat" and "A feline rests atop the rug" convey the same meaning with zero word overlap. Any metric based on exact string matching will score the second translation as zero, which is clearly wrong.
This is why NLP has developed a family of automatic metrics, each with different strengths and blind spots. BLEU measures precision (how much of the output matches the reference). ROUGE measures recall (how much of the reference is captured). METEOR adds stemming and synonym matching. BERTScore uses neural embeddings to capture semantic similarity. None of them fully correlate with human judgment, but together they provide useful signals.
Understanding these metrics matters because they drive model development. Researchers optimize for BLEU in translation, ROUGE in summarization, and perplexity in language modeling. If the metric is flawed, the optimization target is flawed, and the resulting models have predictable failure modes. Knowing what each metric measures (and what it misses) is essential for interpreting NLP research and building production systems.
How It Works
BLEU (Precision-Focused)
BLEU computes modified n-gram precision for $n = 1, 2, 3, 4$, takes their geometric mean, and applies a brevity penalty:
"Modified" precision clips each n-gram count by its maximum count in any reference, preventing a degenerate output like "the the the the" from scoring high against a reference containing "the."
ROUGE (Recall-Focused)
ROUGE-N measures n-gram recall: what fraction of reference n-grams appear in the output.
ROUGE-L uses the longest common subsequence (LCS) instead of fixed n-grams, capturing word order without requiring contiguous matches.
METEOR
METEOR improves on BLEU by:
- Matching words via exact match, stemming ("running" matches "ran"), and synonym lookup (WordNet)
- Computing an alignment between candidate and reference
- Penalizing fragmented matches (words matched but in wrong order)
METEOR correlates better with human judgment than BLEU, especially at the sentence level.
Perplexity
Perplexity measures how well a language model predicts held-out text:
A perplexity of 20 means the model is, on average, as uncertain as choosing uniformly among 20 words. Lower is better. Perplexity is used to evaluate language models themselves, not their generated outputs.
BERTScore
BERTScore computes token-level cosine similarity between BERT embeddings of the candidate and reference tokens, then takes the maximum similarity for each token (greedy matching). This captures semantic equivalence: "feline" and "cat" have high embedding similarity even though they share no characters.
Complexity Analysis
BLEU and ROUGE are cheap (milliseconds per sentence). BERTScore requires a BERT forward pass per sentence pair, making it 100-1000x slower but significantly more accurate for semantic equivalence.
Implementation
ALGORITHM ComputeROUGE_N(candidate, reference, n)
INPUT: candidate: list of tokens, reference: list of tokens, n: integer (n-gram order)
OUTPUT: rougeN: float (recall score)
BEGIN
// Extract n-grams from both
candNgrams <- empty map
FOR i FROM 0 TO LENGTH(candidate) - n DO
ngram <- TUPLE(candidate[i .. i+n-1])
candNgrams[ngram] <- candNgrams[ngram] + 1
END FOR
refNgrams <- empty map
FOR i FROM 0 TO LENGTH(reference) - n DO
ngram <- TUPLE(reference[i .. i+n-1])
refNgrams[ngram] <- refNgrams[ngram] + 1
END FOR
// Count matches (clipped by candidate count)
matchCount <- 0
refTotal <- 0
FOR EACH (ngram, refCount) IN refNgrams DO
candCount <- candNgrams[ngram] IF ngram IN candNgrams ELSE 0
matchCount <- matchCount + MIN(candCount, refCount)
refTotal <- refTotal + refCount
END FOR
IF refTotal = 0 THEN RETURN 0
rougeN <- matchCount / refTotal // recall
RETURN rougeN
END
ALGORITHM ComputeROUGE_L(candidate, reference)
INPUT: candidate: list of tokens, reference: list of tokens
OUTPUT: rougeL: float (F1 based on LCS)
BEGIN
m <- LENGTH(candidate)
n <- LENGTH(reference)
// LCS via dynamic programming
dp <- CREATE 2D array [(m+1) x (n+1)], all zeros
FOR i FROM 1 TO m DO
FOR j FROM 1 TO n DO
IF candidate[i-1] = reference[j-1] THEN
dp[i][j] <- dp[i-1][j-1] + 1
ELSE
dp[i][j] <- MAX(dp[i-1][j], dp[i][j-1])
END IF
END FOR
END FOR
lcsLen <- dp[m][n]
IF lcsLen = 0 THEN RETURN 0
precision <- lcsLen / m
recall <- lcsLen / n
f1 <- (2 * precision * recall) / (precision + recall)
RETURN f1
END
ALGORITHM ComputePerplexity(model, testTokens)
INPUT: model: language model, testTokens: list of N token indices
OUTPUT: perplexity: float
BEGIN
totalLogProb <- 0
FOR i FROM 1 TO N - 1 DO
logProbs <- model.LOG_PROBS(testTokens[0 .. i-1])
totalLogProb <- totalLogProb + logProbs[testTokens[i]]
END FOR
avgNegLogProb <- -totalLogProb / (N - 1)
perplexity <- EXP(avgNegLogProb) // using natural log; or 2^(H) with log base 2
RETURN perplexity
END
Real-World Applications
Key Takeaways
Read More
2025-10-19
Ethics and Bias in Language Models: What Training Data Teaches Machines About Us
How training data bias, toxicity, and hallucination emerge in language models, what debiasing techniques exist, and why responsible deployment requires more than better algorithms.
2025-10-08
Tokenization and Text Normalization: Why GPT Does Not See Words
From whitespace splitting to Byte Pair Encoding and SentencePiece, exploring how machines break text into processable units and why subword tokenization changed NLP forever.
2025-10-09
N-Grams and Language Models: Predicting the Next Word with Counting
From unigrams to trigrams, exploring how the Markov assumption turns word prediction into a counting problem, and why smoothing techniques like Kneser-Ney keep probabilities from collapsing to zero.