Loss functions (also called objective functions or cost functions) are mathematical measures of the error between predicted and actual values. They quantify how well a model is performing and provide the optimization signal for training.

For domain-specific losses, see NLP losses and Computer vision losses. For evaluation metrics, see Metrics and losses.

Cross-Entropy Loss

Cross-entropy loss (or log loss) measures the performance of a classification model whose output is a probability value between 0 and 1.

Binary Cross-Entropy

Categorical Cross-Entropy is used for multi-class problems, where each sample belongs to a single class.

Where:

  • is the ground truth label

  • is the predicted probability

  • is the number of samples

  • is the number of classes

  • Label Smoothing Cross-Entropy: Helps prevent overconfidence by replacing one-hot encoded ground truth with a mixture of the original labels and a uniform distribution.

Mean Squared Error (MSE)

MSE measures the average of the squares of the errors between predicted and actual values.

Variants:

  • Root Mean Squared Error (RMSE): The square root of MSE, providing a measure in the same units as the target variable.
  • Mean Absolute Error (MAE): Uses absolute differences instead of squared differences, making it less sensitive to outliers.

Kullback-Leibler Divergence (KL Divergence)

KL divergence measures how one probability distribution diverges from a second, expected probability distribution.

Where:

  • is the true distribution
  • is the approximated distribution

Applications:

  • Variational autoencoders (VAEs)
  • Knowledge distillation
  • Distribution matching

Variants:

  • Reverse KL Divergence: Swaps the order of the distributions, yielding different behavior.
  • Jensen-Shannon Divergence: A symmetrized and smoothed version of KL divergence.

Regularization Losses

These are typically added to the main loss function to prevent overfitting and improve generalization.

L1 Regularization (Lasso)

Properties:

  • Encourages sparse weights (many weights become exactly zero)
  • Less sensitive to outliers than L2
  • Can be used for feature selection

L2 Regularization (Ridge)

Properties:

  • Penalizes large weights more heavily
  • Rarely sets weights to exactly zero
  • More stable solutions than L1 regularization

Elastic Net

Properties:

  • Combines L1 and L2 regularization
  • Can select groups of correlated features
  • More robust than either L1 or L2 alone