Gradient Boosting Explained
Building Intuition
Training Phase
We have the dataset. We want to create 10 trees.
Age |
Gender |
Exercise |
Weight (y) |
36 |
Male |
Yes |
83 |
41 |
Female |
Yes |
56 |
39 |
Male |
No |
92 |
STEP 0: We first find the average of all y values
\(\frac{83+55+92}{3} = 77\)
STEP 1: We find residuals (average - y value)
Age |
Gender |
Exercise |
Weight (y) |
Residuals |
36 |
Male |
Yes |
83 |
83-77=6 |
41 |
Female |
Yes |
56 |
56-77=-21 |
39 |
Male |
No |
92 |
92-77=15 |
STEP 2: Build a decision tree using the residuals
<insert decision tree diagram>
STEP 3: Find the average for each tree node
STEP 4: Update the model and calculate the new prediction.
<insert update model diagram>
<insert table>
We repeat step 1 again 9 more times until we have a total of 10 trees.
Our final model will look like this:
<insert final model equation>
Testing phase
<insert new row>
Dive into the Maths
<Gradient boosting intuition by Killian>
We first define the inputs:
i) Our dataset is represented by \(\{x^{(i)}, y^{(i)}\}\)
x: features (age, gender, exercise)
y: label (weight)
i: \(i^{th}\) row of the dataset, e.g: i=3 means \(3^{rd}\) row
\(x^{(i)}: i^{th}\) row of feature
\(y^{(i)}: i^{th}\) row of label
(ii) Loss function that is differentiable, \(L(y^{(i)}, \hat{y})\)
(iii) Number of trees, m
In our case, we want to create 10 trees so m = 10
STEP 0: We first find the average of all y values
We do not have a model yet so we need to create a base model to initialize a constant value.
Let us start by creating a loss function. A loss function measures the difference between the real value and our model prediction \((y-\hat{y}\)). We squared them so that the difference is always positive and add a \(\frac{1}{2}\) in front so that it will have a nice value after differentiation. (explain more later)
So our loss function looks like this:
\(L(y^{(i)}, \hat{y})
= \sum_{i=1}^n \frac{1}{2}(y^{(i)}-\hat{y})^2
= \frac{1}{2}[(y^{(1)}-\hat{y})^2 + (y^{(2)}-\hat{y})^2 + ... + (y^{(n)}-\hat{y})^2]\)
<insert diagram 1>
A good value to initialize would be a value that has minimal loss. We want to find the prediction value \(\hat{y}\) that gives us the minimal loss for all the training data. We can express this mathematically as
\(h_0(x) = \arg \min_\hat{y} \sum_{i=1}^n \frac{1}{2}(y^{(i)}-\hat{y})^2\)
We first find the gradient \(\frac{\partial L}{\partial \hat{y}}\) by differentiating \(L(y^{(i)}, \hat{y})\)
(using Chain Rule)
\(\frac{\partial L}{\partial \hat{y}} = \frac{1}{2}[(-2)(y^{(1)} - \hat{y})+ (-2)(y^{(2)}-\hat{y}) + ... + (-2)(y^{(n)}-\hat{y})]\)
(Factor -2 out and group \(y, \hat{y}\) together)
\(= \frac{1}{2}(-2)[(y^{(1)}+ y^{(2)} +...+ y^{(n)}) - n\hat{y}]\)
<insert diagram 2)
Loss is minimum when \(\frac{\partial L}{\partial \hat{y}} = 0\)
\(\frac{1}{2}(-2)[(y^{(1)}+ y^{(2)} +...+ y^{(n)}) - n\hat{y}] = 0\)
\(n\hat{y} = y^{(1)}+ y^{(2)} +...+ y^{(n)}\)
\(\hat{y} = \frac{ y^{(1)}+ y^{(2)} +...+ y^{(n)}}{n}\)
<insert diagram 3>
Loss is minimum when \(\hat{y}\) is the average of all y values. So we initiate the value as the average of all y values.
STEP 1
\(L(y^{(i)}, \hat{y}) = \frac{1}{2}(y^{(i)}-\hat{y})^2\)
Differentiating the loss function
\(\frac{\partial L}{\partial \hat{y}} = \frac{1}{2} (-2)(y-\hat{y})\)
\(-\frac{\partial L}{\partial \hat{y}} = (y-\hat{y})\)
\(r_{im} = -[\frac{\partial L}{\partial \hat{y}}]\)
\(= y - \hat{y}\)
So we can get the residuals by subtracting y values and predicted values.
STEP 2
Once we got the pseudo-residuals our new dataset becomes \(\{x_i, r_{im}\}\). We now fit the dataset to a weak learner such as decision tree. See this article to understand how a decision tree is formed.
STEP 3
Our tree has a total of \(J\) leaf nodes, in our case \(J=\). We want each leaf node to make prediction with the lowest loss. So for each leaf node \(j\) where \(j=1,2,...,J\), we want to compute
\(\hat{y}_{j,t}= \arg \min_\hat{y} \sum_{x^{(i)} \in R_{i,j}} L(y^{(i)},h_{t-1}(x^{(i)})+\hat{y}) \)