Gradient Boosting Explained

Building Intuition

Training Phase

We have the dataset. We want to create 10 trees.

Age

Gender

Exercise

Weight (y)

36

Male

Yes

83

41

Female

Yes

56

39

Male

No

92

STEP 0: We first find the average of all y values

\(\frac{83+55+92}{3} = 77\)

STEP 1: We find residuals (average - y value)

Age

Gender

Exercise

Weight (y)

Residuals

36

Male

Yes

83

83-77=6

41

Female

Yes

56

56-77=-21

39

Male

No

92

92-77=15

STEP 2: Build a decision tree using the residuals

<insert decision tree diagram>

STEP 3: Find the average for each tree node

STEP 4: Update the model and calculate the new prediction.

<insert update model diagram>

<insert table>

We repeat step 1 again 9 more times until we have a total of 10 trees.

Our final model will look like this:

<insert final model equation>

Testing phase

<insert new row>

Dive into the Maths

<Gradient boosting intuition by Killian>

We first define the inputs:

i) Our dataset is represented by \(\{x^{(i)}, y^{(i)}\}\)
x: features (age, gender, exercise)
y: label (weight)
i: \(i^{th}\) row of the dataset, e.g: i=3 means \(3^{rd}\) row
\(x^{(i)}: i^{th}\) row of feature
\(y^{(i)}: i^{th}\) row of label

(ii) Loss function that is differentiable, \(L(y^{(i)}, \hat{y})\)

(iii) Number of trees, m
In our case, we want to create 10 trees so m = 10

STEP 0: We first find the average of all y values

We do not have a model yet so we need to create a base model to initialize a constant value.

Let us start by creating a loss function. A loss function measures the difference between the real value and our model prediction \((y-\hat{y}\)). We squared them so that the difference is always positive and add a \(\frac{1}{2}\) in front so that it will have a nice value after differentiation. (explain more later)

So our loss function looks like this:
\(L(y^{(i)}, \hat{y}) = \sum_{i=1}^n \frac{1}{2}(y^{(i)}-\hat{y})^2 = \frac{1}{2}[(y^{(1)}-\hat{y})^2 + (y^{(2)}-\hat{y})^2 + ... + (y^{(n)}-\hat{y})^2]\)

<insert diagram 1>

A good value to initialize would be a value that has minimal loss. We want to find the prediction value \(\hat{y}\) that gives us the minimal loss for all the training data. We can express this mathematically as
\(h_0(x) = \arg \min_\hat{y} \sum_{i=1}^n \frac{1}{2}(y^{(i)}-\hat{y})^2\)

We first find the gradient \(\frac{\partial L}{\partial \hat{y}}\) by differentiating \(L(y^{(i)}, \hat{y})\)
(using Chain Rule)
\(\frac{\partial L}{\partial \hat{y}} = \frac{1}{2}[(-2)(y^{(1)} - \hat{y})+ (-2)(y^{(2)}-\hat{y}) + ... + (-2)(y^{(n)}-\hat{y})]\)
(Factor -2 out and group \(y, \hat{y}\) together)
\(= \frac{1}{2}(-2)[(y^{(1)}+ y^{(2)} +...+ y^{(n)}) - n\hat{y}]\)

<insert diagram 2)

Loss is minimum when \(\frac{\partial L}{\partial \hat{y}} = 0\)

\(\frac{1}{2}(-2)[(y^{(1)}+ y^{(2)} +...+ y^{(n)}) - n\hat{y}] = 0\)
\(n\hat{y} = y^{(1)}+ y^{(2)} +...+ y^{(n)}\)
\(\hat{y} = \frac{ y^{(1)}+ y^{(2)} +...+ y^{(n)}}{n}\)

<insert diagram 3>

Loss is minimum when \(\hat{y}\) is the average of all y values. So we initiate the value as the average of all y values.

STEP 1

\(L(y^{(i)}, \hat{y}) = \frac{1}{2}(y^{(i)}-\hat{y})^2\)
Differentiating the loss function
\(\frac{\partial L}{\partial \hat{y}} = \frac{1}{2} (-2)(y-\hat{y})\)
\(-\frac{\partial L}{\partial \hat{y}} = (y-\hat{y})\)

\(r_{im} = -[\frac{\partial L}{\partial \hat{y}}]\)
\(= y - \hat{y}\)

So we can get the residuals by subtracting y values and predicted values.

STEP 2

Once we got the pseudo-residuals our new dataset becomes \(\{x_i, r_{im}\}\). We now fit the dataset to a weak learner such as decision tree. See this article to understand how a decision tree is formed.

STEP 3

Our tree has a total of \(J\) leaf nodes, in our case \(J=\). We want each leaf node to make prediction with the lowest loss. So for each leaf node \(j\) where \(j=1,2,...,J\), we want to compute
\(\hat{y}_{j,t}= \arg \min_\hat{y} \sum_{x^{(i)} \in R_{i,j}} L(y^{(i)},h_{t-1}(x^{(i)})+\hat{y}) \)