Decision Tree

Using Decision Tree

Decision tree is a machine learning algorithm used for classification and regression tasks. A decision tree look like this

Lets say we want to predict whether a person is diabetic using decision tree. The person has the following features:

Overweight Family Diabetic Age > 45
No Yes No

The person is not overweight so we go to right node, next family is diabetic so we go to the left node. We have reach a leaf node which tells us that the person is diabetic.

Training Decision Tree

How do we construct the decision tree above in the first place? We basically fed data to the decision tree algorithm and it will learn and construct the tree from the given data.

The learning steps are:
STEP1: Randomly select a feature and perform the split
STEP2: Compute Gini Impurity of the split
STEP3: Compute the information Gain
STEP4: Repeat for all features
STEP5: Select the split with highest Information Gain

We have the dataset below

Overweight Family Diabetic Age > 45 Diabetic
Yes No No Yes
Yes Yes Yes Yes
No Yes No Yes
No Yes Yes Yes
No Yes Yes Yes
Yes No Yes No
No No No No
No No No No
Yes No Yes No

STEP 1

The algorithm randomly selects a feature (in this case Overweight) and split the data into 2 groups (nodes) depending on whether the person is overweight or not.

Row 1,2,6,9 split to left node while node 3,4,5,7,8 split to right node.

Overweight Family Diabetic Age > 45 Diabetic
Yes No No Yes
Yes Yes Yes Yes
No Yes No Yes
No Yes Yes Yes
No Yes Yes Yes
Yes No Yes No
No No No No
No No No No
Yes No Yes No

Among those who are overweight, 2 have diabetes and 2 don't have. For those who are not overweight 3 have diabetes and 2 don't have.

STEP2

Next we want to find how similar (pure/homogenous) are the items in each node.

<insert 2 node, 1 homogenous, 1 not>

All the items in this node is YES, so this node has high purity. We want nodes with high purity.

The node is a mixed bag of 50% YES and 50% NO, so this node has low purity. We don’t want nodes with low purity.

We prefer high purity node because we can classify a person easily. For example based on this decision tree if a person is overweight then we can conclude he is diabetic because the left node only contains diabetic = yes person.

We can quantify the purity of a node using Gini Impurity.

\(Gini = 1 - \sum_{i=1}^N {P_i}^2\)

N is the total number of classes, in our case N = 2 (Diabetic: Yes and No).
\(P_{i}\) is the probability of the \(i^{th}\) class. For example \(P_{1}\) can be the probability of diabetic = Yes and \(P_{2}\) be the probability of diabetic = No.

The Gini value ranges between 0 and 0.5. Gini value closer to 0.5 means the group has lower purity while Gini value closer to 0 has higher purity. Now we compute Gini Impurity for all the 3 nodes.

Gini Impurity for left node (2 Yes, 2 No):
\(Gini = 1 - \frac{2}{4}^2 - \frac{2}{4}^2 = 0.5\)

Gini Impurity for right node (3 Yes 2 No):
\(Gini = 1 - \frac{3}{5}^2 - \frac{2}{5}^2 = 0.48\)

Gini Impurity for parent node (5 Yes 4 No):
\(Gini = 1 - \frac{5}{9}^2 - \frac{4}{9}^2 = 0.4938\)

STEP 3

We want to know how much does the purity change after the split. We prefer splits that bring our group from less pure to more pure. We can quantify this change in purity using Information Gain (IG). High IG means the split is more pure.

\(IG = Gini(Parent) - W_1*Gini(Left) - W_2*Gini(Right)\)

\(IG = 0.4983 - \frac{4}{9}(0.5) - \frac{5}{9}(0.48) = 0.009411\)

STEP 4

After this we repeat STEP 1 which randomly choose from the remaining feature (Family diabetic, age > 40). Step 1 to 3 are repeated until no features left.

Splitting by family diabetic

Gini Impurity for left node (4 Yes, 0 No):
\(Gini = 1 - \frac{4}{4}^2 - \frac{0}{4}^2 = 0\)

Gini Impurity for right node (1 Yes 4 No):
\(Gini = 1 - \frac{1}{5}^2 - \frac{4}{5}^2 = 0.32\)

Gini Impurity for parent node (4 Yes 5 No):
\(Gini = 1 - \frac{4}{9}^2 - \frac{5}{9}^2 = 0.4938\)

\(IG = 0.4983 - \frac{4}{9}(0) - \frac{5}{9}(0.32) = 0.3205\)

Splitting by Age > 45

Gini Impurity for left node (3 Yes, 2 No):
\(Gini = 1 - \frac{3}{5}^2 - \frac{2}{5}^2 = 0.48\)

Gini Impurity for right node (2 Yes 2 No):
\(Gini = 1 - \frac{2}{4}^2 - \frac{2}{4}^2 = 0.5\)

Gini Impurity for parent node (5 Yes 4 No):
\(Gini = 1 - \frac{5}{9}^2 - \frac{4}{9}^2 = 0.4938\)

\(IG = 0.4983 - \frac{5}{9}(0.48) - \frac{4}{9}(0.5) = 0.009411\)

STEP 5

Split By IG
Overweight 0.009411
Family diabetic 0.3205
Age > 45 0.009411

We decide to split by Family diabetic because it has the highest Information Gain. This means the purity will increase the most after splitting.

Step 1-5 is repeated. Our final decision tree looks like this:

Why Use Decision Tree?

1) Highly Interpretable
We can explain to stakeholders why our model made the predictions. Even better, we can tell them the exact parameters (for example: when the weather is not rainy and it is a public holiday then people will buy ice cream). Stakeholders will have more confidence in our model compared to a black box model spitting out predictions without us having any ideas how the predictions are made.

2) No need for data preprocessing
If the feature is numerical, we don’t have to perform normalization or scaling, on the other hand if it is categorical we don’t have to perform one hot encoding as well.

Disadvantages
The decision boundary is more complex compared to other algorithms (like ?). Hence it is more likely to overfit and as a result has a higher variance.