### Machine Translation & Text Generation Metrics (Summarization, Image Captioning, Dialogue)
1. **BLEU (Bilingual Evaluation Understudy)**: Measures the precision of n-gram matches between the candidate translation and reference translations. Correlates moderately with human judgment, doesn't capture fluency or semantic meaning well. Prefers shorter sentences.
$\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$
Where:
- $p_n$ is the modified n-gram precision
- $w_n$ is the weight for n-gram precision (typically uniform weights)
- BP is the brevity penalty to penalize short translations
- $N$ is the maximum n-gram size (typically 4)
2. **METEOR (Metric for Evaluation of Translation with Explicit ORdering)**: Calculates an F-score based on word matches, considering synonyms, stemming, and word order.
$\text{METEOR} = F_{mean} \cdot (1 - \text{Penalty})$
Where:
- $F_{mean}$ is a weighted harmonic mean of precision and recall
- Penalty accounts for fragmentation (poor word order)
3. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**: Measures n-gram recall (and sometimes precision/F1) between generated and reference texts.
- **ROUGE-N**: Measures n-gram overlap.
$\text{ROUGE-N} = \frac{\sum_{S \in {\text{References}}} \sum_{\text{gram}_n \in S} \text{Count}_{\text{match}}(\text{gram}_n)}{\sum_{S \in {\text{References}}} \sum_{\text{gram}_n \in S} \text{Count}(\text{gram}_n)}$
- **ROUGE-L**: Measures the longest common subsequence (LCS). Captures sentence-level structure similarity.
- **ROUGE-S**: Measures skip-bigram overlap (pairs of words in sentence order with gaps allowed). ROUGE-SU extends ROUGE-S by adding unigrams to the evaluation.
4. **chrF**: Character n-gram F-score.
$\text{chrF} = (1 + \beta^2) \cdot \frac{\text{chrP} \cdot \text{chrR}}{\beta^2 \cdot \text{chrP} + \text{chrR}}$
Where:
- chrP is character n-gram precision
- chrR is character n-gram recall
- β determines the recall importance (typically β=2)
5. **BERTScore**: Uses BERT embeddings to compute similarity between candidate and reference translations at the token level.
6. **MoverScore**: Uses contextualized embeddings and Earth Mover's Distance to measure semantic distance between generated and reference texts.
7. **Perplexity**: Measures how well a model predicts a sample. Lower perplexity indicates better prediction. The exponentiated average negative log-likelihood of a sequence.
$\text{Perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i | w_1, ..., w_{i-1})}$
Where:
- $P(w_i | w_1, ..., w_{i-1})$ is the conditional probability of word $w_i$ given previous words
- $N$ is the number of words
8. **Bits Per Character (BPC)**: Similar to perplexity but measured at the character level.
$\text{BPC} = -\frac{1}{N} \sum_{i=1}^{N} \log_2 P(c_i | c_1, ..., c_{i-1})$
Where:
- $P(c_i | c_1, ..., c_{i-1})$ is the conditional probability of character $c_i$ given previous characters
- $N$ is the total number of characters
### Question Answering Metrics
1. **Exact Match (EM)**: Binary measure indicating whether the predicted answer exactly matches the ground truth answer. Can be also calculated as a percentage of predictions that match one of the ground truth answers exactly.
$\text{EM} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(\text{prediction}_i = \text{groundtruth}_i)$
2. **F1 Score**: Word-level F1 score between prediction and ground truth, treating both as bags of words.
3. **Answer Accuracy**: For multiple-choice QA, the proportion of questions answered correctly.
4. **Mean Reciprocal Rank (MRR)**: For QA models that return a ranked list of answers, the average of the reciprocal of the rank of the correct answer.
$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$
### Sequence Labeling Metrics (Named Entity Recognition - NER, Part-of-Speech - POS Tagging)
Named Entity Recognition (NER) involves identifying and classifying named entities in text into predefined categories.
1. **Entity-Level F1 Score**: F1 score calculated at the entity level rather than the token level.
2. **Span-Based F1 Score**: F1 score based on the exact match of entity spans.
3. **Partial Matching Metrics**:
- **Partial Precision/Recall**: Give credit for partial overlap between predicted and true entities.
- **Type-Based Evaluation**: Separate evaluation for entity type classification and entity boundary detection.
4. **CoNLL Score**: The average of F1 scores across all entity types.
## Links
- [ROUGE: A Package for Automatic Evaluation of Summaries](https://aclanthology.org/W04-1013/)
- [BLEU: a Method for Automatic Evaluation of Machine Translation](https://aclanthology.org/P02-1040/)
- [HuggingFace Evaluate Library](https://huggingface.co/docs/evaluate/index)