Loss functions commonly used in NLP tasks. For general losses (Cross-Entropy, MSE, KL Divergence), see General losses. For NLP evaluation metrics, see NLP metrics.

Negative Log-Likelihood (NLL) Loss

NLL loss is commonly used in language modeling and sequence prediction.

Where:

  • is the predicted probability of the true token/class

Applications:

  • Language modeling
  • Machine translation
  • Text generation
  • Sequence prediction

Perplexity

Perplexity is an exponential transformation of the average negative log-likelihood, making it interpretable as the weighted average number of choices the model is uncertain about. Perplexity is an evaluation metric, but minimizing NLL is equivalent to minimizing perplexity.

Where:

  • is the probability of the -th token given previous tokens

Applications:

  • Language modeling
  • Text generation evaluation
  • Speech recognition

Connectionist Temporal Classification (CTC) Loss

CTC loss aligns sequence-to-sequence data without requiring pre-segmented training data or explicit alignments.

Where:

  • is the set of all possible alignments that correspond to the target sequence
  • is the probability of alignment at time given input

Applications:

  • Speech recognition
  • Handwriting recognition
  • Protein sequence alignment

Triplet Loss

Triplet loss learns embeddings where similar items are closer together and dissimilar items are farther apart.

Where:

  • is the anchor sample
  • is a positive sample similar to the anchor
  • is a negative sample dissimilar to the anchor
  • is a distance function (typically Euclidean or cosine)
  • margin is a hyperparameter

Applications:

  • Sentence embeddings
  • Document similarity
  • Face recognition
  • Image retrieval

Contrastive Loss

Contrastive loss is used to learn discriminative features by pushing similar samples closer and dissimilar samples further apart.

Where:

  • is 0 for dissimilar pairs and 1 for similar pairs
  • is the distance between samples

Applications:

  • Sentence similarity
  • Learning text embeddings
  • Siamese networks for document comparison

Reinforcement Learning from Human Feedback (RLHF) Losses

PPO (Proximal Policy Optimization) Loss

Where:

  • is the ratio of new policy probability to old policy probability
  • is the advantage estimate
  • is a hyperparameter that constrains policy updates

Direct Preference Optimization (DPO) Loss

Where:

  • is the policy being trained
  • is the reference policy
  • are input, preferred output, and dispreferred output
  • is a hyperparameter

Applications:

  • Fine-tuning language models based on human preferences
  • Aligning large language models with human values
  • Improving language model outputs for specific criteria