Loss functions (also called objective functions or cost functions) are mathematical measures of the error between predicted and actual values. They quantify how well a model is performing and provide the optimization signal for training.
For domain-specific losses, see NLP losses and Computer vision losses. For evaluation metrics, see Metrics and losses.
When to use which loss
| Loss | When to use |
|---|---|
| Binary Cross-Entropy (BCE) | Binary classification. |
| Categorical Cross-Entropy (CCE) | Multi-class classification with a single correct class. |
| Label Smoothing CE | Reduce overconfidence; regularize class probabilities. |
| MSE | Regression; penalizes large errors heavily; differentiable. |
| RMSE | MSE in the target’s own units. |
| MAE | Regression, less sensitive to outliers than MSE. |
| KL Divergence | Distribution matching — VAEs, knowledge distillation. |
| L1 / Lasso | Sparse weights; feature selection. |
| L2 / Ridge | Penalize large weights; stable training. |
| Elastic Net | L1 + L2 combined. |
Cross-Entropy Loss
Cross-entropy loss (or log loss) measures the performance of a classification model whose output is a probability value between 0 and 1.
Binary Cross-Entropy (BCE)
Categorical Cross-Entropy (CCE)
Used for multi-class problems where each sample belongs to a single class.
Where:
- is the ground truth label.
- is the predicted probability.
- is the number of samples.
- is the number of classes.
Label Smoothing Cross-Entropy
Helps prevent overconfidence by replacing one-hot encoded ground truth with a mixture of the original labels and a uniform distribution.
Mean Squared Error (MSE)
Average of the squares of the errors between predicted and actual values.
Root Mean Squared Error (RMSE)
Square root of MSE — same units as the target variable.
Mean Absolute Error (MAE)
Absolute differences instead of squared, making it less sensitive to outliers.
Kullback-Leibler Divergence (KL Divergence)
Measures how one probability distribution diverges from a second, expected probability distribution.
Where:
- is the true distribution.
- is the approximated distribution.
Applications: Variational autoencoders (VAEs), knowledge distillation, distribution matching.
Variants:
- Reverse KL Divergence — swaps the order of the distributions, yielding different behavior.
- Jensen-Shannon Divergence — symmetrized and smoothed version of KL divergence.
Regularization Losses
These are typically added to the main loss function to prevent overfitting and improve generalization.
L1 Regularization (Lasso)
Properties:
- Encourages sparse weights (many weights become exactly zero).
- Less sensitive to outliers than L2.
- Can be used for feature selection.
L2 Regularization (Ridge)
Properties:
- Penalizes large weights more heavily.
- Rarely sets weights to exactly zero.
- More stable solutions than L1 regularization.
Elastic Net
Properties:
- Combines L1 and L2 regularization.
- Can select groups of correlated features.
- More robust than either L1 or L2 alone.