NLP metrics evaluate tasks from machine translation and summarization to question answering and sequence labeling. Most boil down to n-gram overlap, embedding similarity, or classification-style precision/recall tailored to the output type.
When to use which metric
| Metric | When to use |
|---|---|
| BLEU | Translation / generation; n-gram precision. |
| ROUGE-N / L / S | Summarization; n-gram, LCS, or skip-bigram recall. |
| METEOR | Translation with synonym and stemming awareness. |
| chrF | Character-level n-gram F-score; robust for morphologically rich languages. |
| BERTScore | Semantic similarity via BERT embeddings. |
| MoverScore | Semantic distance via Earth Mover’s on contextualized embeddings. |
| Perplexity | Language modeling — how well a model predicts a sequence. |
| Bits per Character (BPC) | Same idea as perplexity, character-level. |
| Exact Match (EM) | QA — does the predicted answer match exactly? |
| QA F1 | QA — word-level bag-of-words F1. |
| Answer Accuracy | Multiple-choice QA. |
| MRR | Rank of the first correct answer (QA / IR). |
| Entity-level F1 | NER at the entity level. |
| Span-based F1 | NER with exact span match. |
| CoNLL Score | NER averaged across entity types. |
Machine Translation & Text Generation Metrics
Used for translation, summarization, image captioning, and dialogue.
BLEU (Bilingual Evaluation Understudy)
Measures the precision of n-gram matches between the candidate translation and reference translations. Correlates moderately with human judgment; doesn’t capture fluency or semantic meaning well. Prefers shorter sentences.
Where:
- is the modified n-gram precision.
- is the weight for n-gram precision (typically uniform weights).
- BP is the brevity penalty to penalize short translations.
- is the maximum n-gram size (typically 4).
METEOR (Metric for Evaluation of Translation with Explicit ORdering)
F-score based on word matches, considering synonyms, stemming, and word order.
Where:
- is a weighted harmonic mean of precision and recall.
- Penalty accounts for fragmentation (poor word order).
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Measures n-gram recall (and sometimes precision / F1) between generated and reference texts.
- ROUGE-N — n-gram overlap.
- ROUGE-L — longest common subsequence (LCS). Captures sentence-level structure similarity.
- ROUGE-S — skip-bigram overlap (pairs of words in sentence order with gaps allowed). ROUGE-SU extends ROUGE-S by adding unigrams.
chrF
Character n-gram F-score.
Where:
- chrP is character n-gram precision.
- chrR is character n-gram recall.
- determines recall importance (typically ).
BERTScore
Uses BERT embeddings to compute similarity between candidate and reference translations at the token level.
MoverScore
Uses contextualized embeddings and Earth Mover’s Distance to measure semantic distance between generated and reference texts.
Perplexity
Measures how well a model predicts a sample. Lower perplexity indicates better prediction. The exponentiated average negative log-likelihood of a sequence.
Where:
- is the conditional probability of word given previous words.
- is the number of words.
Bits Per Character (BPC)
Similar to perplexity but measured at the character level.
Where:
- is the conditional probability of character given previous characters.
- is the total number of characters.
Question Answering Metrics
Exact Match (EM)
Binary measure indicating whether the predicted answer exactly matches the ground truth. Can also be calculated as a percentage of predictions that match one of the ground truth answers exactly.
F1 Score
Word-level F1 between prediction and ground truth, treating both as bags of words.
Answer Accuracy
For multiple-choice QA, the proportion of questions answered correctly.
Mean Reciprocal Rank (MRR)
For QA models that return a ranked list of answers, the average of the reciprocal rank of the correct answer.
Sequence Labeling Metrics
Named Entity Recognition (NER) and Part-of-Speech (POS) tagging involve identifying and classifying named entities or parts of speech in text.
Entity-Level F1 Score
F1 score calculated at the entity level rather than the token level.
Span-Based F1 Score
F1 score based on the exact match of entity spans.
Partial Matching Metrics
- Partial Precision / Recall — give credit for partial overlap between predicted and true entities.
- Type-Based Evaluation — separate evaluation for entity type classification and entity boundary detection.
CoNLL Score
Average of F1 scores across all entity types.