A Decision Tree is a supervised learning algorithm used for both classification and regression tasks. It creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
## Structure
* Root Node: The topmost node representing the entire dataset.
* Internal Nodes: Nodes that represent the dataset's features and are used to make decisions.
* Branches: Connections between nodes, representing decision rules.
* Leaf Nodes: Terminal nodes that provide the final output (class label (mode of the training instances) or value (mean of the training instances)).
## Training Process
1. Feature Selection: Choose the best attribute to split the data.
2. Split Point Decision: Determine the best split point for the chosen feature.
3. Splitting: Divide the dataset based on the chosen feature and split point: left child note contains data points where the feature value is less than the split point, right child node - with values more than the split point.
4. Recursion: Repeat steps 1-3 for each child node until stopping criteria are met.
### Splitting Criteria
* Classification Trees:
* Gini Impurity: $G = 1 - \sum\limits_k (p_k)^2$
* Entropy: $S = -\sum_{i=1}^{N}p_i \log_2{p_i}$
* Regression Trees:
* Variance (equivalent to MSE): $D = \frac{1}{\ell} \sum\limits_{i =1}^{\ell} (y_i - \frac{1}{\ell} \sum\limits_{j=1}^{\ell} y_j)^2$
### Stopping Criteria
* Maximum depth reached
* Minimum number of samples in a node
* Minimum decrease in impurity
* All samples in a node belong to the same class
## Ensemble Methods Using Decision Trees
* [[Random Forest]]
* [[Gradient boosting]]
## Advantages
* Easy to understand and interpret
* Requires little data preprocessing
* Can handle both numerical and categorical data
* Can be visualized easily
## Disadvantages
- Can create overly complex trees that do not generalize well (overfitting)
- Can be unstable; small variations in the data can result in a completely different tree
- Biased towards features with more levels (in categorical variables)
- Cannot predict beyond the range of the training data (for regression tasks)
## [[Regularization]]
* Pruning: Removing branches that provide little predictive power
* Setting a minimum number of samples required at a leaf node
* Setting a maximum depth of the tree
* Setting a maximum number of features to consider for splitting
## Feature Importance
Decision Trees provide built-in feature importance: Importance is calculated based on how much each feature decreases the weighted impurity.
## Example of entropy calculation:
20 balls: 9 blue, 11 yellow.
$S_0 = -\frac{9}{20}\log_2{\frac{9}{20}}-\frac{11}{20}\log_2{\frac{11}{20}} \approx 1$
Now 13 balls: 8 blue and 5 yellow: $S_1 = -\frac{5}{13}\log_2{\frac{5}{13}}-\frac{8}{13}\log_2{\frac{8}{13}} \approx 0.96$
## Links
* [Explained.ai: Decision Tree](https://mlu-explain.github.io/decision-tree/)
* [A visual introduction to machine learning: Decision Tree example](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)