Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the loss function. It discourages learning a more complex or flexible model, to favor a simpler, more generalizable one.
Regularization refers to a set of techniques used in machine learning to prevent overfitting and improve the generalization of models. A common way of applying regularization is adding a penalty term to the loss function, but there are many other methods, some of which are model-agnostic. In general, regularization aims to:
1. Prevent overfitting
2. Improve model generalization
3. Handle multicollinearity in regression problems
4. Feature selection (in case of L1 regularization)
## Common types of regularization
### L1 Regularization (Lasso)
- Adds the sum of absolute values of coefficients to the loss function: $Loss + λ * Σ|w|$
- Tends to produce sparse models and thus can be used for feature selection
- Can shrink coefficients to exactly zero
### L2 Regularization (Ridge)
- Adds the sum of squared values of coefficients to the loss function: $Loss + λ * Σw²$
- Shrinks all coefficients
- Handles multicollinearity well
![[Pasted image 20240715191319.png]]
### Elastic Net
- Combination of L1 and L2 regularization: $Loss + λ₁ * Σ|w| + λ₂ * Σw²$
- Balances the benefits of L1 and L2
- Tends to select groups of correlated variables together, unlike Lasso, which may arbitrarily select one
## Model-specific types of regularization
* L1, L2, Elastic net in linear models
* Pruning and decreasing complexity in tree-based models
- C parameter (inverse of regularization strength) in Support Vector Machines
## Model-agnostic types of regularization
- Early stopping: stops training when performance on a validation set starts to degrade
- Data Augmentation: artificially increases the size of the training set by generating new changed samples
- Noise injection: adds random noise to inputs or weights during training
## Neural-net specific types of regularization
- Dropout: randomly drops a number of output units during training
- Weight decay (L2): adds a penalty term to the loss function proportional to the sum of squared weights
- Batch normalization: normalizes the input of each layer across the batch, reduces internal covariate shift
- Layer normalization: normalizes the input of each layer across the features
- Label Smoothing: replaces hard labels with soft probabilities
- Gradient Clipping: limits the size of the gradients during backpropagation
- Stochastic Depth: randomly drops entire layers during training
> [!warning] Be careful with using too many types of regularization blindly
> There are cases when different types of regularization have unexpected interactions.
>
> Weight Decay (L2) and Batch Normalization: when used together with BN an L2 penalty no longer has its original regularizing effect. Instead, it becomes essentially equivalent to an adaptive adjustment of the learning rate! [Source](https://blog.janestreet.com/l2-regularization-and-batch-norm/)
>
> Dropout and Batch Normalization: use BN before Dropout, not after. [Link](https://stackoverflow.com/questions/39691902/ordering-of-batch-normalization-and-dropout) to the discussion.