Gradient Boosting Explained

Chong Kok Hung

Jul 8, 2023 • 3 min read

Building Intuition

Training Phase

We have the dataset. We want to create 10 trees.

Age	Gender	Exercise	Weight (y)
36	Male	Yes	83
41	Female	Yes	56
39	Male	No	92

STEP 0: We first find the average of all y values

\(\frac{83+55+92}{3} = 77\)

STEP 1: We find residuals (average - y value)

Age	Gender	Exercise	Weight (y)	Residuals
36	Male	Yes	83	83-77=6
41	Female	Yes	56	56-77=-21
39	Male	No	92	92-77=15

STEP 2: Build a decision tree using the residuals

STEP 3: Find the average for each tree node

STEP 4: Update the model and calculate the new prediction.

We repeat step 1 again 9 more times until we have a total of 10 trees.

Our final model will look like this:

Testing phase

Dive into the Maths

We first define the inputs:

i) Our dataset is represented by \(\{x^{(i)}, y^{(i)}\}\)
x: features (age, gender, exercise)
y: label (weight)
i: \(i^{th}\) row of the dataset, e.g: i=3 means \(3^{rd}\) row
\(x^{(i)}: i^{th}\) row of feature
\(y^{(i)}: i^{th}\) row of label

(ii) Loss function that is differentiable, \(L(y^{(i)}, \hat{y})\)

(iii) Number of trees, m
In our case, we want to create 10 trees so m = 10

STEP 0: We first find the average of all y values

We do not have a model yet so we need to create a base model to initialize a constant value.

Let us start by creating a loss function. A loss function measures the difference between the real value and our model prediction \((y-\hat{y}\)). We squared them so that the difference is always positive and add a \(\frac{1}{2}\) in front so that it will have a nice value after differentiation. (explain more later)

So our loss function looks like this:
\(L(y^{(i)}, \hat{y}) = \sum_{i=1}^n \frac{1}{2}(y^{(i)}-\hat{y})^2 = \frac{1}{2}[(y^{(1)}-\hat{y})^2 + (y^{(2)}-\hat{y})^2 + ... + (y^{(n)}-\hat{y})^2]\)

A good value to initialize would be a value that has minimal loss. We want to find the prediction value \(\hat{y}\) that gives us the minimal loss for all the training data. We can express this mathematically as
\(h_0(x) = \arg \min_\hat{y} \sum_{i=1}^n \frac{1}{2}(y^{(i)}-\hat{y})^2\)

We first find the gradient \(\frac{\partial L}{\partial \hat{y}}\) by differentiating \(L(y^{(i)}, \hat{y})\)
(using Chain Rule)
\(\frac{\partial L}{\partial \hat{y}} = \frac{1}{2}[(-2)(y^{(1)} - \hat{y})+ (-2)(y^{(2)}-\hat{y}) + ... + (-2)(y^{(n)}-\hat{y})]\)
(Factor -2 out and group \(y, \hat{y}\) together)
\(= \frac{1}{2}(-2)[(y^{(1)}+ y^{(2)} +...+ y^{(n)}) - n\hat{y}]\)

<insert diagram 2)

Loss is minimum when \(\frac{\partial L}{\partial \hat{y}} = 0\)

\(\frac{1}{2}(-2)[(y^{(1)}+ y^{(2)} +...+ y^{(n)}) - n\hat{y}] = 0\)
\(n\hat{y} = y^{(1)}+ y^{(2)} +...+ y^{(n)}\)
\(\hat{y} = \frac{ y^{(1)}+ y^{(2)} +...+ y^{(n)}}{n}\)

Loss is minimum when \(\hat{y}\) is the average of all y values. So we initiate the value as the average of all y values.

STEP 1

\(L(y^{(i)}, \hat{y}) = \frac{1}{2}(y^{(i)}-\hat{y})^2\)
Differentiating the loss function
\(\frac{\partial L}{\partial \hat{y}} = \frac{1}{2} (-2)(y-\hat{y})\)
\(-\frac{\partial L}{\partial \hat{y}} = (y-\hat{y})\)

\(r_{im} = -[\frac{\partial L}{\partial \hat{y}}]\)
\(= y - \hat{y}\)

So we can get the residuals by subtracting y values and predicted values.

STEP 2

Once we got the pseudo-residuals our new dataset becomes \(\{x_i, r_{im}\}\). We now fit the dataset to a weak learner such as decision tree. See this article to understand how a decision tree is formed.

STEP 3

Our tree has a total of \(J\) leaf nodes, in our case \(J=\). We want each leaf node to make prediction with the lowest loss. So for each leaf node \(j\) where \(j=1,2,...,J\), we want to compute
\(\hat{y}_{j,t}= \arg \min_\hat{y} \sum_{x^{(i)} \in R_{i,j}} L(y^{(i)},h_{t-1}(x^{(i)})+\hat{y}) \)

Sign up for more like this.