Let’s dive into how XGBoost works to address Linear Regression and Classification problems, supported by examples and insights from the original research paper.
XGBoost for Linear Regression
Here, XGBoost works like gradient boosting but speed and performance is much better. For our example, base estimator will be mean followed by training decision tree. In XGBoost, process of constructing decision tree is different than gradient boosting.
Example consists a dataset of some student with expected salary based on their grades out of 10.
Now, to calculate Similarity score (SS), take grades and residual column.
Here, λ is Regularization parameter, for ease let’s keep is as 0. To find SS for leaf node ( Residual values) , calculate root SS :
SS = (-2.8 + 3.7–1.3 + 0.7)² / 4 + 0 = 0.02
The goal is to split leaf node in a way that increases the similarity score after the split. Split will be based on grade column, from table above, three potential splitting criteria are identified. Among these, one with highest SS will be selected for root node.
Take first splitting criteria — 5.85
Calculate SS for both branches to find gain.
SS(left) = 0.7²/1+0 = 0.49 | SS(right) = (-2.8–1.3+3.7)²/3+0 = 0.05
Gain = SS(left) + SS(right) — SS(root)
Gain for 1st criteria = 0.40 + 0.05- 0.02 = 0.52
Take second splitting criteria — 7.1
Following the same steps as previously mentioned:
Gain for 2nd criteria = 5.06
Take third splitting criteria — 8.25
Gain for 3rd criteria = 17.52
Out of all three, third splitting criteria yields maximum gain, so it is selected as the root node. This process continues for remaining nodes, with each split being selected based on highest gain. By repeating this process, XGBoost construct decision tree. In XGBoost depth is already set at 6 for large dataset.
Output values are calculated for all leaf nodes to evaluate model performance.
Output = Sum of Residuals / (number of Residuals + λ)
Let’s assume combine model perform well after two decision tree, this is how it will look like —
XGBoost for Classification
Let’s briefly go through classification. In this case, the process remain same, only difference is that the base estimator is logarithm of the odds log(odds).
Concept of log(odds) —
- Odds: Ratio of probability of an event occurring to the probability of it not occurring. [Odds = P/(1-P)]
- Log(Odds): It transform probabilities into a continuous unbounded scale which helps in Linearization and Interpretability.
Similarity score (SS) formula is different for classification case.
and Output Value is calculated by-
Apart from this, every other step is similar to the process in linear regression. By following this approach, we can build a decision tree for a classification problem, and then get a combined model of several decision trees to achieve better performance, ultimately reducing the residuals to nearly zero.