Qin Xuqiang

Decision Trees (DT) :evergreen_tree:

This post begins a series where I write about fundamentals of machine learning in my own understanding. The main purpose of this series is to help myself, but hopefully someone else can also get something out of it.

Basics

A decision tree is a supervised leanring algorithm. It can be used both for classification and regression. As the name suggests, * it uses a (usually binary) tree structure, where at each node a decision is make, usually regarding some features of the dataset. *

Take the titanic problem for example, where we are supposed to predict whether a passenger will survive (label) based on their information (features). Below is a simple decision tree:

                ┌────────────────────┐
                │ First class?       │
                └────────────────────┘
                    │         │
                Yes │         │ No
                    ▼         ▼
         ┌────────────────────┐   ┌───────────┐
         │ Age < 30?          │   │ Dead      │
         └────────────────────┘   └───────────┘
              │        │
          Yes │        │ No
              ▼        ▼
         ┌────────┐ ┌────────┐
         │Survived│ │  Dead  │
         └────────┘ └────────┘

We can see that both numerical (age) and categorical (passenger class) features are used in the tree.

Structurely, a decision tree is a flowchart-like structure:

The learning happens at each node where the machine choses which feature and partition of the feature to use.

Pros of DT:

Cons:

As we mentioned above, a single decision tree is prone to overfitting. To remedy this, we can use an ensemble of decision trees. Below are two popular approaches:

Random Forests

A random forest is a collection of decision trees, wehre each tree is trained on a random subset of the data and its features. The final prediction is made by aggregating the predictions of all the trees (for classification: majority vote; for regression: average).

Key points:

Gradient Boosting

Gradient boostings starts with a weak model, usually a small DT, measure its errors, then train a new model to predict the error and adding the new model to the ensemble. This sequential process is then repeated manytimes wehre each model learn from the mistakes of the previous ones.

Step 1: Start with an initial prediction (e.g., average)
    ┌──────────────┐
    │   Model 0    │
    │  (baseline)  │
    └──────────────┘
           │
           ▼
     Residuals/Error

Step 2: Fit a new tree to residuals
    ┌──────────────┐
    │   Model 1    │
    │  (residuals) │
    └──────────────┘
           │
           ▼
     Updated Predictions

Step 3: Fit another tree to new residuals
    ┌──────────────┐
    │   Model 2    │
    │  (new error) │
    └──────────────┘
           │
           ▼
     More Accurate Predictions

Repeat this process for N trees:
    Final Prediction = 
    Model₀ + η·Model₁ + η·Model₂ + ... + η·Modelₙ

Where η = learning rate (a small number, like 0.1)

Key points: