A Decision Tree is a supervised learning algorithm used for both classification and regression tasks. It creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Structure

  • Root Node: The topmost node representing the entire dataset.
  • Internal Nodes: Nodes that represent the dataset’s features and are used to make decisions.
  • Branches: Connections between nodes, representing decision rules.
  • Leaf Nodes: Terminal nodes that provide the final output (class label (mode of the training instances) or value (mean of the training instances)).

Training Process

  1. Feature Selection: Choose the best attribute to split the data.
  2. Split Point Decision: Determine the best split point for the chosen feature.
  3. Splitting: Divide the dataset based on the chosen feature and split point: left child note contains data points where the feature value is less than the split point, right child node - with values more than the split point.
  4. Recursion: Repeat steps 1-3 for each child node until stopping criteria are met.

Splitting Criteria

  • Classification Trees:
    • Gini Impurity:
    • Entropy:
  • Regression Trees:
    • Variance (equivalent to MSE):

Stopping Criteria

  • Maximum depth reached
  • Minimum number of samples in a node
  • Minimum decrease in impurity
  • All samples in a node belong to the same class

Ensemble Methods Using Decision Trees

Advantages

  • Easy to understand and interpret
  • Requires little data preprocessing
  • Can handle both numerical and categorical data
  • Can be visualized easily

Disadvantages

  • Can create overly complex trees that do not generalize well (overfitting)
  • Can be unstable; small variations in the data can result in a completely different tree
  • Biased towards features with more levels (in categorical variables)
  • Cannot predict beyond the range of the training data (for regression tasks)

Regularization

  • Pruning: Removing branches that provide little predictive power
  • Setting a minimum number of samples required at a leaf node
  • Setting a maximum depth of the tree
  • Setting a maximum number of features to consider for splitting

Feature Importance

Decision Trees provide built-in feature importance: Importance is calculated based on how much each feature decreases the weighted impurity.

Example of entropy calculation:

20 balls: 9 blue, 11 yellow.

Now 13 balls: 8 blue and 5 yellow: