Machine Translation & Text Generation Metrics (Summarization, Image Captioning, Dialogue)
- BLEU (Bilingual Evaluation Understudy): Measures the precision of n-gram matches between the candidate translation and reference translations. Correlates moderately with human judgment, doesn’t capture fluency or semantic meaning well. Prefers shorter sentences.
Where:
- is the modified n-gram precision
- is the weight for n-gram precision (typically uniform weights)
- BP is the brevity penalty to penalize short translations
- is the maximum n-gram size (typically 4)
- METEOR (Metric for Evaluation of Translation with Explicit ORdering): Calculates an F-score based on word matches, considering synonyms, stemming, and word order.
Where:
- is a weighted harmonic mean of precision and recall
- Penalty accounts for fragmentation (poor word order)
-
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures n-gram recall (and sometimes precision/F1) between generated and reference texts.
- ROUGE-N: Measures n-gram overlap.
- ROUGE-L: Measures the longest common subsequence (LCS). Captures sentence-level structure similarity.
- ROUGE-S: Measures skip-bigram overlap (pairs of words in sentence order with gaps allowed). ROUGE-SU extends ROUGE-S by adding unigrams to the evaluation.
-
chrF: Character n-gram F-score.
Where:
- chrP is character n-gram precision
- chrR is character n-gram recall
- β determines the recall importance (typically β=2)
- BERTScore: Uses BERT embeddings to compute similarity between candidate and reference translations at the token level.
- MoverScore: Uses contextualized embeddings and Earth Mover’s Distance to measure semantic distance between generated and reference texts.
- Perplexity: Measures how well a model predicts a sample. Lower perplexity indicates better prediction. The exponentiated average negative log-likelihood of a sequence.
Where:
- is the conditional probability of word given previous words
- is the number of words
- Bits Per Character (BPC): Similar to perplexity but measured at the character level.
Where:
- is the conditional probability of character given previous characters
- is the total number of characters
Question Answering Metrics
- Exact Match (EM): Binary measure indicating whether the predicted answer exactly matches the ground truth answer. Can be also calculated as a percentage of predictions that match one of the ground truth answers exactly.
-
F1 Score: Word-level F1 score between prediction and ground truth, treating both as bags of words.
-
Answer Accuracy: For multiple-choice QA, the proportion of questions answered correctly.
-
Mean Reciprocal Rank (MRR): For QA models that return a ranked list of answers, the average of the reciprocal of the rank of the correct answer.
Sequence Labeling Metrics (Named Entity Recognition - NER, Part-of-Speech - POS Tagging)
Named Entity Recognition (NER) involves identifying and classifying named entities in text into predefined categories.
-
Entity-Level F1 Score: F1 score calculated at the entity level rather than the token level.
-
Span-Based F1 Score: F1 score based on the exact match of entity spans.
-
Partial Matching Metrics:
- Partial Precision/Recall: Give credit for partial overlap between predicted and true entities.
- Type-Based Evaluation: Separate evaluation for entity type classification and entity boundary detection.
-
CoNLL Score: The average of F1 scores across all entity types.