ENSEMBLE LEARNING
Fitting to errors one booster stage at a time
Of course, in machine learning, we want our predictions spot on. We started with simple decision trees — they worked okay. Then came Random Forests and AdaBoost, which did better. But Gradient Boosting? That was a game-changer, making predictions way more accurate.
They said, “What makes Gradient Boosting work so well is actually simple: it builds models one after another, where each new model focuses on fixing the mistakes of all previous models combined. This way of fixing errors step by step is what makes it special.” I thought it’s really gonna be that simple but every time I look up Gradient Boosting, trying to understand how it works, I see the same thing: rows and rows of complex math formulas and ugly charts that somehow drive me insane. Just try it.
Let’s put a stop to this and break it down in a way that actually makes sense. We’ll visually navigate through the training steps of Gradient Boosting, focusing on a regression case — a simpler scenario than classification — so we can avoid the confusing math. Like a multi-stage rocket shedding unnecessary weight to reach orbit, we’ll blast away those prediction errors one residual at a time.
Definition
Gradient Boosting is an ensemble machine learning technique that builds a series of decision trees, each aimed at correcting the errors of the previous ones. Unlike AdaBoost, which uses shallow trees, Gradient Boosting uses deeper trees as its weak learners. Each new tree focuses on minimizing the residual errors — the differences between actual and predicted values — rather than learning directly from the original targets.
For regression tasks, Gradient Boosting adds trees one after another with each new tree is trained to reduce the remaining errors by addressing the current residual errors. The final prediction is made by adding up the outputs from all the trees.
The model’s strength comes from its additive learning process — while each tree focuses on correcting the remaining errors in the ensemble, the sequential combination creates a powerful predictor that progressively reduces the overall prediction error by focusing on the parts of the problem where the model still struggles.
Dataset Used
Throughout this article, we’ll focus on the classic golf dataset as an example for regression. While Gradient Boosting can handle both regression and classification tasks effectively, we’ll concentrate on the simpler task which in this case is the regression — predicting the number of players who will show up to play golf based on weather conditions.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast',
'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain',
'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast',
'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temp.': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humid.': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Num_Players': [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29,
25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]
}
# Prepare data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='')
df['Wind'] = df['Wind'].astype(int)
# Split features and target
X, y = df.drop('Num_Players', axis=1), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
Main Mechanism
Here’s how Gradient Boosting works:
- Initialize Model: Start with a simple prediction, typically the mean of target values.
- Iterative Learning: For a set number of iterations, compute the residuals, train a decision tree to predict these residuals, and add the new tree’s predictions (scaled by the learning rate) to the running total.
- Build Trees on Residuals: Each new tree focuses on the remaining errors from all previous iterations.
- Final Prediction: Sum up all tree contributions (scaled by the learning rate) and the initial prediction.
Training Steps
We’ll follow the standard gradient boosting approach:
1.0. Set Model Parameters:
Before building any trees, we need set the core parameters that control the learning process:
· the number of trees (typically 100, but we’ll choose 50) to build sequentially,
· the learning rate (typically 0.1), and
· the maximum depth of each tree (typically 3)
For the First Tree
2.0 Make an initial prediction for the label. This is typically the mean (just like a dummy prediction.)
2.1. Calculate temporary residual (or pseudo-residuals):
residual = actual value — predicted value
2.2. Build a decision tree to predict these residuals. The tree building steps are exactly the same as in the regression tree.
a. Calculate initial MSE (Mean Squared Error) for the root node
b. For each feature:
· Sort data by feature values
· For each possible split point:
·· Split samples into left and right groups
·· Calculate MSE for both groups
·· Calculate MSE reduction for this split
c. Pick the split that gives the largest MSE reduction
d. Continue splitting until reaching maximum depth or minimum samples per leaf.
2.3. Calculate Leaf Values
For each leaf, find the mean of residuals.
2.4. Update Predictions
· For each data point in the training dataset, determine which leaf it falls into based on the new tree.
· Multiply the new tree’s predictions by the learning rate and add these scaled predictions to the current model’s predictions. This will be the updated prediction.
For the Second Tree
2.1. Calculate new residuals based on current model
a. Compute the difference between the target and current predictions.
These residuals will be a bit different from the first iteration.
2.2. Build a new tree to predict these residuals. Same process as first tree, but targeting new residuals.
2.3. Calculate the mean residuals for each leaf
2.4. Update model predictions
· Multiply the new tree’s predictions by the learning rate.
· Add the new scaled tree predictions to the running total.
For the Third Tree onwards
Repeat Steps 2.1–2.3 for remaining iterations. Note that each tree sees different residuals.
· Trees progressively focus on harder-to-predict patterns
· Learning rate prevents overfitting by limiting each tree’s contribution
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor# Train the model
clf = GradientBoostingRegressor(criterion='squared_error', learning_rate=0.1, random_state=42)
clf.fit(X_train, y_train)
# Plot trees 1, 2, 49, and 50
plt.figure(figsize=(11, 20), dpi=300)
for i, tree_idx in enumerate([0, 2, 24, 49]):
plt.subplot(4, 1, i+1)
plot_tree(clf.estimators_[tree_idx,0],
feature_names=X_train.columns,
impurity=False,
filled=True,
rounded=True,
precision=2,
fontsize=12)
plt.title(f'Tree {tree_idx + 1}')
plt.suptitle('Decision Trees from GradientBoosting', fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
Testing Step
For predicting:
a. Start with the initial prediction (the average number of players)
b. Run the input through each tree to get its predicted adjustment
c. Scale each tree’s prediction by the learning rate.
d. Add all these adjustments to the initial prediction
e. The sum directly gives us the predicted number of players
Evaluation Step
After building all the trees, we can evaluate the test set.
# Get predictions
y_pred = clf.predict(X_test)# Create DataFrame with actual and predicted values
results_df = pd.DataFrame({
'Actual': y_test,
'Predicted': y_pred
})
print(results_df) # Display results DataFrame
# Calculate and display RMSE
from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(y_test, y_pred)
print(f"\nModel Accuracy: {rmse:.4f}")
Key Parameters
Here are the key parameters for Gradient Boosting, particularly in scikit-learn
:
max_depth
: The depth of trees used to model residuals. Unlike AdaBoost which uses stumps, Gradient Boosting works better with deeper trees (typically 3-8 levels). Deeper trees capture more complex patterns but risk overfitting.
n_estimators
: The number of trees to be used (typically 100-1000). More trees usually improve performance when paired with a small learning rate.
learning_rate
: Also called “shrinkage”, this scales each tree’s contribution (typically 0.01-0.1). Smaller values require more trees but often give better results by making the learning process more fine-grained.
subsample
: The fraction of samples used to train each tree (typically 0.5-0.8). This optional feature adds randomness that can improve robustness and reduce overfitting.
These parameters work together: a small learning rate needs more trees, while deeper trees might need a smaller learning rate to avoid overfitting.
Key differences from AdaBoost
Both AdaBoost and Gradient Boosting are boosting algorithms, but the way they learn from their mistakes are different. Here are the key differences:
max_depth
is typically higher (3-8) in Gradient Boosting, while AdaBoost prefers stumps.- No
sample_weight
updates because Gradient Boosting uses residuals instead of sample weighting. - The
learning_rate
is typically much smaller (0.01-0.1) compared to AdaBoost’s larger values (0.1-1.0). - Initial prediction starts from the mean while AdaBoost starts from zero.
- Trees are combined through simple addition rather than weighted voting, making each tree’s contribution more straightforward.
- Optional
subsample
parameter adds randomness, a feature not present in standard AdaBoost.
Pros:
- Step-by-Step Error Fixing: In Gradient Boosting, each new tree focuses on correcting the mistakes made by the previous ones. This makes the model better at improving its predictions in areas where it was previously wrong.
- Flexible Error Measures: Unlike AdaBoost, Gradient Boosting can optimize different types of error measurements (like mean absolute error, mean squared error, or others). This makes it adaptable to various kinds of problems.
- High Accuracy: By using more detailed trees and carefully controlling the learning rate, Gradient Boosting often provides more accurate results than other algorithms, especially for well-structured data.
Cons:
- Risk of Overfitting: The use of deeper trees and the sequential building process can cause the model to fit the training data too closely, which may reduce its performance on new data. This requires careful tuning of tree depth, learning rate, and the number of trees.
- Slow Training Process: Like AdaBoost, trees must be built one after another, making it slower to train compared to algorithms that can build trees in parallel, like Random Forest. Each tree relies on the errors of the previous ones.
- High Memory Use: The need for deeper and more numerous trees means Gradient Boosting can consume more memory than simpler boosting methods such as AdaBoost.
- Sensitive to Settings: The effectiveness of Gradient Boosting heavily depends on finding the right combination of learning rate, tree depth, and number of trees, which can be more complex and time-consuming than tuning simpler algorithms.
Gradient Boosting is a major improvement in boosting algorithms. This success has led to popular versions like XGBoost and LightGBM, which are widely used in machine learning competitions and real-world applications.
While Gradient Boosting requires more careful tuning than simpler algorithms — especially when adjusting the depth of decision trees, the learning rate, and the number of trees — it is very flexible and powerful. This makes it a top choice for problems with structured data.
Gradient Boosting can handle complex relationships that simpler methods like AdaBoost might miss. Its continued popularity and ongoing improvements show that the approach of using gradients and building models step-by-step remains highly important in modern machine learning.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast',
'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain',
'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast',
'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temp.': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humid.': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Num_Players': [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29,
25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]
}
# Prepare data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='')
df['Wind'] = df['Wind'].astype(int)
# Split features and target
X, y = df.drop('Num_Players', axis=1), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Train Gradient Boosting
gb = GradientBoostingRegressor(
n_estimators=50, # Number of boosting stages (trees)
learning_rate=0.1, # Shrinks the contribution of each tree
max_depth=3, # Depth of each tree
subsample=0.8, # Fraction of samples used for each tree
random_state=42
)
gb.fit(X_train, y_train)
# Predict and evaluate
y_pred = gb.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error: {rmse:.2f}")