NLP metrics

NLP metrics evaluate tasks from machine translation and summarization to question answering and sequence labeling. Most boil down to n-gram overlap, embedding similarity, or classification-style precision/recall tailored to the output type.

When to use which metric

Metric	When to use
BLEU	Translation / generation; n-gram precision.
ROUGE-N / L / S	Summarization; n-gram, LCS, or skip-bigram recall.
METEOR	Translation with synonym and stemming awareness.
chrF	Character-level n-gram F-score; robust for morphologically rich languages.
BERTScore	Semantic similarity via BERT embeddings.
MoverScore	Semantic distance via Earth Mover’s on contextualized embeddings.
Perplexity	Language modeling — how well a model predicts a sequence.
Bits per Character (BPC)	Same idea as perplexity, character-level.
Exact Match (EM)	QA — does the predicted answer match exactly?
QA F1	QA — word-level bag-of-words F1.
Answer Accuracy	Multiple-choice QA.
MRR	Rank of the first correct answer (QA / IR).
Entity-level F1	NER at the entity level.
Span-based F1	NER with exact span match.
CoNLL Score	NER averaged across entity types.

Machine Translation & Text Generation Metrics

Used for translation, summarization, image captioning, and dialogue.

BLEU (Bilingual Evaluation Understudy)

Measures the precision of n-gram matches between the candidate translation and reference translations. Correlates moderately with human judgment; doesn’t capture fluency or semantic meaning well. Prefers shorter sentences.

BLEU = BP \cdot exp (n = 1 \sum N w_{n} lo g p_{n})

Where:

$p_{n}$ is the modified n-gram precision.
$w_{n}$ is the weight for n-gram precision (typically uniform weights).
BP is the brevity penalty to penalize short translations.
$N$ is the maximum n-gram size (typically 4).

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

F-score based on word matches, considering synonyms, stemming, and word order.

METEOR = F_{m e an} \cdot (1 - Penalty)

Where:

$F_{m e an}$ is a weighted harmonic mean of precision and recall.
Penalty accounts for fragmentation (poor word order).

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Measures n-gram recall (and sometimes precision / F1) between generated and reference texts.

ROUGE-N — n-gram overlap.

ROUGE-N = \frac{\sum _{S \in References} \sum _{gram_{n} \in S} Count _{match} ( gram _{n} )}{\sum _{S \in References} \sum _{gram_{n} \in S} Count ( gram _{n} )}

ROUGE-L — longest common subsequence (LCS). Captures sentence-level structure similarity.
ROUGE-S — skip-bigram overlap (pairs of words in sentence order with gaps allowed). ROUGE-SU extends ROUGE-S by adding unigrams.

chrF

Character n-gram F-score.

chrF = (1 + β^{2}) \cdot \frac{chrP \cdot chrR}{β ^{2} \cdot chrP + chrR}

Where:

chrP is character n-gram precision.
chrR is character n-gram recall.
$β$ determines recall importance (typically $β = 2$ ).

BERTScore

Uses BERT embeddings to compute similarity between candidate and reference translations at the token level.

MoverScore

Uses contextualized embeddings and Earth Mover’s Distance to measure semantic distance between generated and reference texts.

Perplexity

Measures how well a model predicts a sample. Lower perplexity indicates better prediction. The exponentiated average negative log-likelihood of a sequence.

Perplexity = 2^{- \frac{1}{N} \sum_{i = 1}^{N} l o g_{2} P (w_{i} ∣ w_{1}, ..., w_{i - 1})}

Where:

$P (w_{i} ∣ w_{1}, ..., w_{i - 1})$ is the conditional probability of word $w_{i}$ given previous words.
$N$ is the number of words.

Bits Per Character (BPC)

Similar to perplexity but measured at the character level.

BPC = - \frac{1}{N} i = 1 \sum N lo g_{2} P (c_{i} ∣ c_{1}, ..., c_{i - 1})

Where:

$P (c_{i} ∣ c_{1}, ..., c_{i - 1})$ is the conditional probability of character $c_{i}$ given previous characters.
$N$ is the total number of characters.

Question Answering Metrics

Exact Match (EM)

Binary measure indicating whether the predicted answer exactly matches the ground truth. Can also be calculated as a percentage of predictions that match one of the ground truth answers exactly.

EM = \frac{1}{N} i = 1 \sum N 1 (prediction_{i} = groundtruth_{i})

F1 Score

Word-level F1 between prediction and ground truth, treating both as bags of words.

Answer Accuracy

For multiple-choice QA, the proportion of questions answered correctly.

Mean Reciprocal Rank (MRR)

For QA models that return a ranked list of answers, the average of the reciprocal rank of the correct answer.

MRR = \frac{1}{∣ Q ∣} i = 1 \sum ∣ Q ∣ \frac{1}{rank _{i}}

Sequence Labeling Metrics

Named Entity Recognition (NER) and Part-of-Speech (POS) tagging involve identifying and classifying named entities or parts of speech in text.

Entity-Level F1 Score

F1 score calculated at the entity level rather than the token level.

Span-Based F1 Score

F1 score based on the exact match of entity spans.

Partial Matching Metrics

Partial Precision / Recall — give credit for partial overlap between predicted and true entities.
Type-Based Evaluation — separate evaluation for entity type classification and entity boundary detection.

CoNLL Score

Average of F1 scores across all entity types.

DSWoK — Data Science Well of Knowledge

Explorer

NLP metrics

When to use which metric

Machine Translation & Text Generation Metrics

BLEU (Bilingual Evaluation Understudy)

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

chrF

BERTScore

MoverScore

Perplexity

Bits Per Character (BPC)

Question Answering Metrics

Exact Match (EM)

F1 Score

Answer Accuracy

Mean Reciprocal Rank (MRR)

Sequence Labeling Metrics

Entity-Level F1 Score

Span-Based F1 Score

Partial Matching Metrics

CoNLL Score

Links

Graph View

Table of Contents

Backlinks