Loss functions (also called objective functions or cost functions) are mathematical measures of the error between predicted and actual values. They quantify how well a model is performing and provide the optimization signal for training.
For domain-specific losses, see NLP losses and Computer vision losses. For evaluation metrics, see Metrics and losses.
Cross-Entropy Loss
Cross-entropy loss (or log loss) measures the performance of a classification model whose output is a probability value between 0 and 1.
Binary Cross-Entropy
Categorical Cross-Entropy is used for multi-class problems, where each sample belongs to a single class.
Where:
-
is the ground truth label
-
is the predicted probability
-
is the number of samples
-
is the number of classes
-
Label Smoothing Cross-Entropy: Helps prevent overconfidence by replacing one-hot encoded ground truth with a mixture of the original labels and a uniform distribution.
Mean Squared Error (MSE)
MSE measures the average of the squares of the errors between predicted and actual values.
Variants:
- Root Mean Squared Error (RMSE): The square root of MSE, providing a measure in the same units as the target variable.
- Mean Absolute Error (MAE): Uses absolute differences instead of squared differences, making it less sensitive to outliers.
Kullback-Leibler Divergence (KL Divergence)
KL divergence measures how one probability distribution diverges from a second, expected probability distribution.
Where:
- is the true distribution
- is the approximated distribution
Applications:
- Variational autoencoders (VAEs)
- Knowledge distillation
- Distribution matching
Variants:
- Reverse KL Divergence: Swaps the order of the distributions, yielding different behavior.
- Jensen-Shannon Divergence: A symmetrized and smoothed version of KL divergence.
Regularization Losses
These are typically added to the main loss function to prevent overfitting and improve generalization.
L1 Regularization (Lasso)
Properties:
- Encourages sparse weights (many weights become exactly zero)
- Less sensitive to outliers than L2
- Can be used for feature selection
L2 Regularization (Ridge)
Properties:
- Penalizes large weights more heavily
- Rarely sets weights to exactly zero
- More stable solutions than L1 regularization
Elastic Net
Properties:
- Combines L1 and L2 regularization
- Can select groups of correlated features
- More robust than either L1 or L2 alone