Loss functions commonly used in NLP tasks. For general losses (Cross-Entropy, MSE, KL Divergence), see General losses. For NLP evaluation metrics, see NLP metrics.
Negative Log-Likelihood (NLL) Loss
NLL loss is commonly used in language modeling and sequence prediction.
Where:
- is the predicted probability of the true token/class
Applications:
- Language modeling
- Machine translation
- Text generation
- Sequence prediction
Perplexity
Perplexity is an exponential transformation of the average negative log-likelihood, making it interpretable as the weighted average number of choices the model is uncertain about. Perplexity is an evaluation metric, but minimizing NLL is equivalent to minimizing perplexity.
Where:
- is the probability of the -th token given previous tokens
Applications:
- Language modeling
- Text generation evaluation
- Speech recognition
Connectionist Temporal Classification (CTC) Loss
CTC loss aligns sequence-to-sequence data without requiring pre-segmented training data or explicit alignments.
Where:
- is the set of all possible alignments that correspond to the target sequence
- is the probability of alignment at time given input
Applications:
- Speech recognition
- Handwriting recognition
- Protein sequence alignment
Triplet Loss
Triplet loss learns embeddings where similar items are closer together and dissimilar items are farther apart.
Where:
- is the anchor sample
- is a positive sample similar to the anchor
- is a negative sample dissimilar to the anchor
- is a distance function (typically Euclidean or cosine)
- margin is a hyperparameter
Applications:
- Sentence embeddings
- Document similarity
- Face recognition
- Image retrieval
Contrastive Loss
Contrastive loss is used to learn discriminative features by pushing similar samples closer and dissimilar samples further apart.
Where:
- is 0 for dissimilar pairs and 1 for similar pairs
- is the distance between samples
Applications:
- Sentence similarity
- Learning text embeddings
- Siamese networks for document comparison
Reinforcement Learning from Human Feedback (RLHF) Losses
PPO (Proximal Policy Optimization) Loss
Where:
- is the ratio of new policy probability to old policy probability
- is the advantage estimate
- is a hyperparameter that constrains policy updates
Direct Preference Optimization (DPO) Loss
Where:
- is the policy being trained
- is the reference policy
- are input, preferred output, and dispreferred output
- is a hyperparameter
Applications:
- Fine-tuning language models based on human preferences
- Aligning large language models with human values
- Improving language model outputs for specific criteria