General losses

Loss functions (also called objective functions or cost functions) are mathematical measures of the error between predicted and actual values. They quantify how well a model is performing and provide the optimization signal for training.

For domain-specific losses, see NLP losses and Computer vision losses. For evaluation metrics, see Metrics and losses.

When to use which loss

Loss	When to use
Binary Cross-Entropy (BCE)	Binary classification.
Categorical Cross-Entropy (CCE)	Multi-class classification with a single correct class.
Label Smoothing CE	Reduce overconfidence; regularize class probabilities.
MSE	Regression; penalizes large errors heavily; differentiable.
RMSE	MSE in the target’s own units.
MAE	Regression, less sensitive to outliers than MSE.
KL Divergence	Distribution matching — VAEs, knowledge distillation.
L1 / Lasso	Sparse weights; feature selection.
L2 / Ridge	Penalize large weights; stable training.
Elastic Net	L1 + L2 combined.

Cross-Entropy Loss

Cross-entropy loss (or log loss) measures the performance of a classification model whose output is a probability value between 0 and 1.

Binary Cross-Entropy (BCE)

L_{BCE} = - \frac{1}{N} i = 1 \sum N [y_{i} lo g (\overset{y}{^}_{i}) + (1 - y_{i}) lo g (1 - \overset{y}{^}_{i})]

Categorical Cross-Entropy (CCE)

Used for multi-class problems where each sample belongs to a single class.

L_{CE} = - \frac{1}{N} i = 1 \sum N c = 1 \sum C y_{i, c} lo g (\overset{y}{^}_{i, c})

Where:

$y$ is the ground truth label.
$\overset{y}{^}$ is the predicted probability.
$N$ is the number of samples.
$C$ is the number of classes.

Label Smoothing Cross-Entropy

Helps prevent overconfidence by replacing one-hot encoded ground truth with a mixture of the original labels and a uniform distribution.

\tilde{y}_{i, c} = (1 - α) \cdot y_{i, c} + α \cdot \frac{1}{C}

Mean Squared Error (MSE)

Average of the squares of the errors between predicted and actual values.

L_{MSE} = \frac{1}{N} i = 1 \sum N (y_{i} - \overset{y}{^}_{i})^{2}

Root Mean Squared Error (RMSE)

Square root of MSE — same units as the target variable.

Mean Absolute Error (MAE)

Absolute differences instead of squared, making it less sensitive to outliers.

L_{MAE} = \frac{1}{N} i = 1 \sum N ∣ y_{i} - \overset{y}{^}_{i} ∣

Kullback-Leibler Divergence (KL Divergence)

Measures how one probability distribution diverges from a second, expected probability distribution.

L_{KL} = i \sum P (i) lo g (\frac{P ( i )}{Q ( i )})

Where:

$P$ is the true distribution.
$Q$ is the approximated distribution.

Applications: Variational autoencoders (VAEs), knowledge distillation, distribution matching.

Variants:

Reverse KL Divergence — swaps the order of the distributions, yielding different behavior.
Jensen-Shannon Divergence — symmetrized and smoothed version of KL divergence.

Regularization Losses

These are typically added to the main loss function to prevent overfitting and improve generalization.

L1 Regularization (Lasso)

L_{L1} = λ i = 1 \sum n ∣ w_{i} ∣

Properties:

Encourages sparse weights (many weights become exactly zero).
Less sensitive to outliers than L2.
Can be used for feature selection.

L2 Regularization (Ridge)

L_{L2} = λ i = 1 \sum n w_{i}^{2}

Properties:

Penalizes large weights more heavily.
Rarely sets weights to exactly zero.
More stable solutions than L1 regularization.

Elastic Net

L_{ElasticNet} = λ_{1} i = 1 \sum n ∣ w_{i} ∣ + λ_{2} i = 1 \sum n w_{i}^{2}

Properties:

Combines L1 and L2 regularization.
Can select groups of correlated features.
More robust than either L1 or L2 alone.

DSWoK — Data Science Well of Knowledge

Explorer

General losses

When to use which loss

Cross-Entropy Loss

Binary Cross-Entropy (BCE)

Categorical Cross-Entropy (CCE)

Label Smoothing Cross-Entropy

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

Kullback-Leibler Divergence (KL Divergence)

Regularization Losses

L1 Regularization (Lasso)

L2 Regularization (Ridge)

Elastic Net

Links

Graph View

Table of Contents

Backlinks