Loss functions (also called objective functions or cost functions) are mathematical measures of the error between predicted and actual values. They quantify how well a model is performing and provide the optimization signal for training.

For domain-specific losses, see NLP losses and Computer vision losses. For evaluation metrics, see Metrics and losses.

When to use which loss

LossWhen to use
Binary Cross-Entropy (BCE)Binary classification.
Categorical Cross-Entropy (CCE)Multi-class classification with a single correct class.
Label Smoothing CEReduce overconfidence; regularize class probabilities.
MSERegression; penalizes large errors heavily; differentiable.
RMSEMSE in the target’s own units.
MAERegression, less sensitive to outliers than MSE.
KL DivergenceDistribution matching — VAEs, knowledge distillation.
L1 / LassoSparse weights; feature selection.
L2 / RidgePenalize large weights; stable training.
Elastic NetL1 + L2 combined.

Cross-Entropy Loss

Cross-entropy loss (or log loss) measures the performance of a classification model whose output is a probability value between 0 and 1.

Binary Cross-Entropy (BCE)

Categorical Cross-Entropy (CCE)

Used for multi-class problems where each sample belongs to a single class.

Where:

  • is the ground truth label.
  • is the predicted probability.
  • is the number of samples.
  • is the number of classes.

Label Smoothing Cross-Entropy

Helps prevent overconfidence by replacing one-hot encoded ground truth with a mixture of the original labels and a uniform distribution.

Mean Squared Error (MSE)

Average of the squares of the errors between predicted and actual values.

Root Mean Squared Error (RMSE)

Square root of MSE — same units as the target variable.

Mean Absolute Error (MAE)

Absolute differences instead of squared, making it less sensitive to outliers.

Kullback-Leibler Divergence (KL Divergence)

Measures how one probability distribution diverges from a second, expected probability distribution.

Where:

  • is the true distribution.
  • is the approximated distribution.

Applications: Variational autoencoders (VAEs), knowledge distillation, distribution matching.

Variants:

  • Reverse KL Divergence — swaps the order of the distributions, yielding different behavior.
  • Jensen-Shannon Divergence — symmetrized and smoothed version of KL divergence.

Regularization Losses

These are typically added to the main loss function to prevent overfitting and improve generalization.

L1 Regularization (Lasso)

Properties:

  • Encourages sparse weights (many weights become exactly zero).
  • Less sensitive to outliers than L2.
  • Can be used for feature selection.

L2 Regularization (Ridge)

Properties:

  • Penalizes large weights more heavily.
  • Rarely sets weights to exactly zero.
  • More stable solutions than L1 regularization.

Elastic Net

Properties:

  • Combines L1 and L2 regularization.
  • Can select groups of correlated features.
  • More robust than either L1 or L2 alone.