NLP metrics evaluate tasks from machine translation and summarization to question answering and sequence labeling. Most boil down to n-gram overlap, embedding similarity, or classification-style precision/recall tailored to the output type.

When to use which metric

MetricWhen to use
BLEUTranslation / generation; n-gram precision.
ROUGE-N / L / SSummarization; n-gram, LCS, or skip-bigram recall.
METEORTranslation with synonym and stemming awareness.
chrFCharacter-level n-gram F-score; robust for morphologically rich languages.
BERTScoreSemantic similarity via BERT embeddings.
MoverScoreSemantic distance via Earth Mover’s on contextualized embeddings.
PerplexityLanguage modeling — how well a model predicts a sequence.
Bits per Character (BPC)Same idea as perplexity, character-level.
Exact Match (EM)QA — does the predicted answer match exactly?
QA F1QA — word-level bag-of-words F1.
Answer AccuracyMultiple-choice QA.
MRRRank of the first correct answer (QA / IR).
Entity-level F1NER at the entity level.
Span-based F1NER with exact span match.
CoNLL ScoreNER averaged across entity types.

Machine Translation & Text Generation Metrics

Used for translation, summarization, image captioning, and dialogue.

BLEU (Bilingual Evaluation Understudy)

Measures the precision of n-gram matches between the candidate translation and reference translations. Correlates moderately with human judgment; doesn’t capture fluency or semantic meaning well. Prefers shorter sentences.

Where:

  • is the modified n-gram precision.
  • is the weight for n-gram precision (typically uniform weights).
  • BP is the brevity penalty to penalize short translations.
  • is the maximum n-gram size (typically 4).

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

F-score based on word matches, considering synonyms, stemming, and word order.

Where:

  • is a weighted harmonic mean of precision and recall.
  • Penalty accounts for fragmentation (poor word order).

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Measures n-gram recall (and sometimes precision / F1) between generated and reference texts.

  • ROUGE-N — n-gram overlap.
  • ROUGE-L — longest common subsequence (LCS). Captures sentence-level structure similarity.
  • ROUGE-S — skip-bigram overlap (pairs of words in sentence order with gaps allowed). ROUGE-SU extends ROUGE-S by adding unigrams.

chrF

Character n-gram F-score.

Where:

  • chrP is character n-gram precision.
  • chrR is character n-gram recall.
  • determines recall importance (typically ).

BERTScore

Uses BERT embeddings to compute similarity between candidate and reference translations at the token level.

MoverScore

Uses contextualized embeddings and Earth Mover’s Distance to measure semantic distance between generated and reference texts.

Perplexity

Measures how well a model predicts a sample. Lower perplexity indicates better prediction. The exponentiated average negative log-likelihood of a sequence.

Where:

  • is the conditional probability of word given previous words.
  • is the number of words.

Bits Per Character (BPC)

Similar to perplexity but measured at the character level.

Where:

  • is the conditional probability of character given previous characters.
  • is the total number of characters.

Question Answering Metrics

Exact Match (EM)

Binary measure indicating whether the predicted answer exactly matches the ground truth. Can also be calculated as a percentage of predictions that match one of the ground truth answers exactly.

F1 Score

Word-level F1 between prediction and ground truth, treating both as bags of words.

Answer Accuracy

For multiple-choice QA, the proportion of questions answered correctly.

Mean Reciprocal Rank (MRR)

For QA models that return a ranked list of answers, the average of the reciprocal rank of the correct answer.

Sequence Labeling Metrics

Named Entity Recognition (NER) and Part-of-Speech (POS) tagging involve identifying and classifying named entities or parts of speech in text.

Entity-Level F1 Score

F1 score calculated at the entity level rather than the token level.

Span-Based F1 Score

F1 score based on the exact match of entity spans.

Partial Matching Metrics

  • Partial Precision / Recall — give credit for partial overlap between predicted and true entities.
  • Type-Based Evaluation — separate evaluation for entity type classification and entity boundary detection.

CoNLL Score

Average of F1 scores across all entity types.