Machine Translation & Text Generation Metrics (Summarization, Image Captioning, Dialogue)

  1. BLEU (Bilingual Evaluation Understudy): Measures the precision of n-gram matches between the candidate translation and reference translations. Correlates moderately with human judgment, doesn’t capture fluency or semantic meaning well. Prefers shorter sentences.

Where:

  • is the modified n-gram precision
  • is the weight for n-gram precision (typically uniform weights)
  • BP is the brevity penalty to penalize short translations
  • is the maximum n-gram size (typically 4)
  1. METEOR (Metric for Evaluation of Translation with Explicit ORdering): Calculates an F-score based on word matches, considering synonyms, stemming, and word order.

Where:

  • is a weighted harmonic mean of precision and recall
  • Penalty accounts for fragmentation (poor word order)
  1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures n-gram recall (and sometimes precision/F1) between generated and reference texts.

    • ROUGE-N: Measures n-gram overlap.

    • ROUGE-L: Measures the longest common subsequence (LCS). Captures sentence-level structure similarity.
    • ROUGE-S: Measures skip-bigram overlap (pairs of words in sentence order with gaps allowed). ROUGE-SU extends ROUGE-S by adding unigrams to the evaluation.
  2. chrF: Character n-gram F-score.

Where:

  • chrP is character n-gram precision
  • chrR is character n-gram recall
  • β determines the recall importance (typically β=2)
  1. BERTScore: Uses BERT embeddings to compute similarity between candidate and reference translations at the token level.
  2. MoverScore: Uses contextualized embeddings and Earth Mover’s Distance to measure semantic distance between generated and reference texts.
  3. Perplexity: Measures how well a model predicts a sample. Lower perplexity indicates better prediction. The exponentiated average negative log-likelihood of a sequence.

Where:

  • is the conditional probability of word given previous words
  • is the number of words
  1. Bits Per Character (BPC): Similar to perplexity but measured at the character level.

Where:

  • is the conditional probability of character given previous characters
  • is the total number of characters

Question Answering Metrics

  1. Exact Match (EM): Binary measure indicating whether the predicted answer exactly matches the ground truth answer. Can be also calculated as a percentage of predictions that match one of the ground truth answers exactly.
  1. F1 Score: Word-level F1 score between prediction and ground truth, treating both as bags of words.

  2. Answer Accuracy: For multiple-choice QA, the proportion of questions answered correctly.

  3. Mean Reciprocal Rank (MRR): For QA models that return a ranked list of answers, the average of the reciprocal of the rank of the correct answer.

Sequence Labeling Metrics (Named Entity Recognition - NER, Part-of-Speech - POS Tagging)

Named Entity Recognition (NER) involves identifying and classifying named entities in text into predefined categories.

  1. Entity-Level F1 Score: F1 score calculated at the entity level rather than the token level.

  2. Span-Based F1 Score: F1 score based on the exact match of entity spans.

  3. Partial Matching Metrics:

    • Partial Precision/Recall: Give credit for partial overlap between predicted and true entities.
    • Type-Based Evaluation: Separate evaluation for entity type classification and entity boundary detection.
  4. CoNLL Score: The average of F1 scores across all entity types.