NLP losses

Loss functions commonly used in NLP tasks. For general losses (Cross-Entropy, MSE, KL Divergence), see General losses. For NLP evaluation metrics, see NLP metrics.

When to use which loss

Loss	When to use
Negative Log-Likelihood (NLL)	Language modeling, sequence prediction.
Perplexity	Evaluation form of NLL — geometric mean of inverse probabilities.
CTC	Sequence-to-sequence without explicit alignment (speech, OCR).
Triplet	Metric learning; similar items close, dissimilar far.
Contrastive	Siamese networks; discriminative feature learning.
PPO	Policy-gradient RL with clipped updates for stability.
DPO	Preference-based LLM fine-tuning without a reward model.

Negative Log-Likelihood (NLL) Loss

Commonly used in language modeling and sequence prediction.

L_{NLL} = - \frac{1}{N} i = 1 \sum N lo g (p (y_{i} ∣ x_{i}))

Where $p (y_{i} ∣ x_{i})$ is the predicted probability of the true token/class.

Applications: Language modeling, machine translation, text generation, sequence prediction.

Perplexity

Exponential transformation of the average negative log-likelihood, making it interpretable as the weighted average number of choices the model is uncertain about. Perplexity is an evaluation metric, but minimizing NLL is equivalent to minimizing perplexity.

Perplexity = exp (- \frac{1}{N} i = 1 \sum N lo g p (y_{i} ∣ y_{< i}))

Where $p (y_{i} ∣ y_{< i})$ is the probability of the $i$ -th token given previous tokens.

Applications: Language modeling, text generation evaluation, speech recognition.

Connectionist Temporal Classification (CTC) Loss

Aligns sequence-to-sequence data without requiring pre-segmented training data or explicit alignments.

L_{CTC} = - lo g π \in A^{- 1} (y) \sum t = 1 \prod T p (π_{t} ∣ x)

Where:

$A^{- 1} (y)$ is the set of all possible alignments that correspond to the target sequence $y$ .
$p (π_{t} ∣ x)$ is the probability of alignment $π$ at time $t$ given input $x$ .

Applications: Speech recognition, handwriting recognition, protein sequence alignment.

Triplet Loss

Learns embeddings where similar items are closer together and dissimilar items are farther apart.

L_{triplet} = max (d (a, p) - d (a, n) + margin, 0)

Where:

$a$ is the anchor sample.
$p$ is a positive sample similar to the anchor.
$n$ is a negative sample dissimilar to the anchor.
$d$ is a distance function (typically Euclidean or cosine).
margin is a hyperparameter.

Applications: Sentence embeddings, document similarity, face recognition, image retrieval.

Contrastive Loss

Used to learn discriminative features by pushing similar samples closer and dissimilar samples further apart.

L_{contrastive} = (1 - Y) \cdot \frac{1}{2} \cdot D^{2} + Y \cdot \frac{1}{2} \cdot max (0, margin - D)^{2}

Where:

$Y$ is 0 for dissimilar pairs and 1 for similar pairs.
$D$ is the distance between samples.

Applications: Sentence similarity, learning text embeddings, Siamese networks for document comparison.

Reinforcement Learning from Human Feedback (RLHF) Losses

PPO (Proximal Policy Optimization) Loss

L_{PPO} = E [min (r_{t} (θ) \cdot A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \cdot A_{t})]

Where:

$r_{t} (θ)$ is the ratio of new policy probability to old policy probability.
$A_{t}$ is the advantage estimate.
$ϵ$ is a hyperparameter that constrains policy updates.

Direct Preference Optimization (DPO) Loss

L_{DPO} = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})]

Where:

$π_{θ}$ is the policy being trained.
$π_{ref}$ is the reference policy.
$(x, y_{w}, y_{l})$ are input, preferred output, and dispreferred output.
$β$ is a hyperparameter.

Applications: Fine-tuning language models based on human preferences, aligning large language models with human values, improving language model outputs for specific criteria.

DSWoK — Data Science Well of Knowledge

Explorer

NLP losses

When to use which loss

Negative Log-Likelihood (NLL) Loss

Perplexity

Connectionist Temporal Classification (CTC) Loss

Triplet Loss

Contrastive Loss

Reinforcement Learning from Human Feedback (RLHF) Losses

PPO (Proximal Policy Optimization) Loss

Direct Preference Optimization (DPO) Loss

Links

Graph View

Table of Contents

Backlinks