Loss functions commonly used in NLP tasks. For general losses (Cross-Entropy, MSE, KL Divergence), see General losses. For NLP evaluation metrics, see NLP metrics.
When to use which loss
| Loss | When to use |
|---|---|
| Negative Log-Likelihood (NLL) | Language modeling, sequence prediction. |
| Perplexity | Evaluation form of NLL — geometric mean of inverse probabilities. |
| CTC | Sequence-to-sequence without explicit alignment (speech, OCR). |
| Triplet | Metric learning; similar items close, dissimilar far. |
| Contrastive | Siamese networks; discriminative feature learning. |
| PPO | Policy-gradient RL with clipped updates for stability. |
| DPO | Preference-based LLM fine-tuning without a reward model. |
Negative Log-Likelihood (NLL) Loss
Commonly used in language modeling and sequence prediction.
Where is the predicted probability of the true token/class.
Applications: Language modeling, machine translation, text generation, sequence prediction.
Perplexity
Exponential transformation of the average negative log-likelihood, making it interpretable as the weighted average number of choices the model is uncertain about. Perplexity is an evaluation metric, but minimizing NLL is equivalent to minimizing perplexity.
Where is the probability of the -th token given previous tokens.
Applications: Language modeling, text generation evaluation, speech recognition.
Connectionist Temporal Classification (CTC) Loss
Aligns sequence-to-sequence data without requiring pre-segmented training data or explicit alignments.
Where:
- is the set of all possible alignments that correspond to the target sequence .
- is the probability of alignment at time given input .
Applications: Speech recognition, handwriting recognition, protein sequence alignment.
Triplet Loss
Learns embeddings where similar items are closer together and dissimilar items are farther apart.
Where:
- is the anchor sample.
- is a positive sample similar to the anchor.
- is a negative sample dissimilar to the anchor.
- is a distance function (typically Euclidean or cosine).
- margin is a hyperparameter.
Applications: Sentence embeddings, document similarity, face recognition, image retrieval.
Contrastive Loss
Used to learn discriminative features by pushing similar samples closer and dissimilar samples further apart.
Where:
- is 0 for dissimilar pairs and 1 for similar pairs.
- is the distance between samples.
Applications: Sentence similarity, learning text embeddings, Siamese networks for document comparison.
Reinforcement Learning from Human Feedback (RLHF) Losses
PPO (Proximal Policy Optimization) Loss
Where:
- is the ratio of new policy probability to old policy probability.
- is the advantage estimate.
- is a hyperparameter that constrains policy updates.
Direct Preference Optimization (DPO) Loss
Where:
- is the policy being trained.
- is the reference policy.
- are input, preferred output, and dispreferred output.
- is a hyperparameter.
Applications: Fine-tuning language models based on human preferences, aligning large language models with human values, improving language model outputs for specific criteria.