Gradient Boosting | Towards Data Science


ENSEMBLE LEARNING

Fitting to errors one booster stage at a time

Towards Data Science

Of course, in machine learning, we want our predictions spot on. We started with simple decision trees — they worked okay. Then came Random Forests and AdaBoost, which did better. But Gradient Boosting? That was a game-changer, making predictions way more accurate.

They said, “What makes Gradient Boosting work so well is actually simple: it builds models one after another, where each new model focuses on fixing the mistakes of all previous models combined. This way of fixing errors step by step is what makes it special.” I thought it’s really gonna be that simple but every time I look up Gradient Boosting, trying to understand how it works, I see the same thing: rows and rows of complex math formulas and ugly charts that somehow drive me insane. Just try it.

Let’s put a stop to this and break it down in a way that actually makes sense. We’ll visually navigate through the training steps of Gradient Boosting, focusing on a regression case — a simpler scenario than classification — so we can avoid the confusing math. Like a multi-stage rocket shedding unnecessary weight to reach orbit, we’ll blast away those prediction errors one residual at a time.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Definition

Gradient Boosting is an ensemble machine learning technique that builds a series of decision trees, each aimed at correcting the errors of the previous ones. Unlike AdaBoost, which uses shallow trees, Gradient Boosting uses deeper trees as its weak learners. Each new tree focuses on minimizing the residual errors — the differences between actual and predicted values — rather than learning directly from the original targets.

For regression tasks, Gradient Boosting adds trees one after another with each new tree is trained to reduce the remaining errors by addressing the current residual errors. The final prediction is made by adding up the outputs from all the trees.

The model’s strength comes from its additive learning process — while each tree focuses on correcting the remaining errors in the ensemble, the sequential combination creates a powerful predictor that progressively reduces the overall prediction error by focusing on the parts of the problem where the model still struggles.

Gradient Boosting is part of the boosting family of algorithms because it builds trees sequentially, with each new tree trying to correct the errors of its predecessors. However, unlike other boosting methods, Gradient Boosting approaches the problem from an optimization perspective.

Dataset Used

Throughout this article, we’ll focus on the classic golf dataset as an example for regression. While Gradient Boosting can handle both regression and classification tasks effectively, we’ll concentrate on the simpler task which in this case is the regression — predicting the number of players who will show up to play golf based on weather conditions.

Columns: ‘Overcast (one-hot-encoded into 3 columns)’, ’Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Windy’ (Yes/No) and ‘Number of Players’ (target feature)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast',
'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain',
'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast',
'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temp.': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humid.': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Num_Players': [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29,
25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]
}

# Prepare data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='')
df['Wind'] = df['Wind'].astype(int)

# Split features and target
X, y = df.drop('Num_Players', axis=1), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

Main Mechanism

Here’s how Gradient Boosting works:

  1. Initialize Model: Start with a simple prediction, typically the mean of target values.
  2. Iterative Learning: For a set number of iterations, compute the residuals, train a decision tree to predict these residuals, and add the new tree’s predictions (scaled by the learning rate) to the running total.
  3. Build Trees on Residuals: Each new tree focuses on the remaining errors from all previous iterations.
  4. Final Prediction: Sum up all tree contributions (scaled by the learning rate) and the initial prediction.
A Gradient Boosting Regressor starts with an average prediction and improves it through multiple trees, each one fixing the previous trees’ mistakes in small steps, until reaching the final prediction.

Training Steps

We’ll follow the standard gradient boosting approach:

1.0. Set Model Parameters:
Before building any trees, we need set the core parameters that control the learning process:
· the number of trees (typically 100, but we’ll choose 50) to build sequentially,
· the learning rate (typically 0.1), and
· the maximum depth of each tree (typically 3)

A tree diagram showing our key settings: each tree will have 3 levels, and we’ll create 50 of them while moving forward in small steps of 0.1.

For the First Tree

2.0 Make an initial prediction for the label. This is typically the mean (just like a dummy prediction.)

To start our predictions, we use the average value (37.43) of all our training data as the first guess for every case.

2.1. Calculate temporary residual (or pseudo-residuals):
residual = actual value — predicted value

Calculating the initial residuals by subtracting the mean prediction (37.43) from each target value in our training set.

2.2. Build a decision tree to predict these residuals. The tree building steps are exactly the same as in the regression tree.

The first decision tree begins its training by searching for patterns in our features that can best predict the calculated residuals from our initial mean prediction.

a. Calculate initial MSE (Mean Squared Error) for the root node

Just like in regular regression trees, we calculate the Mean Squared Error (MSE), but this time we’re measuring the spread of residuals (around zero) instead of actual values (around their mean).

b. For each feature:
· Sort data by feature values

For each feature in our dataset, we sort its values and find potential split points between them, just as we would in a standard decision tree, to determine the best way to divide our residuals.

· For each possible split point:
·· Split samples into left and right groups
·· Calculate MSE for both groups
·· Calculate MSE reduction for this split

Similar to a regular regression tree, we evaluate each split by calculating the weighted MSE of both groups, but here we’re measuring how well the split groups similar residuals rather than similar target values.

c. Pick the split that gives the largest MSE reduction

The tree makes its first split using the “rain” feature at value 0.5, dividing samples into two groups based on their residuals — this first decision will be refined by additional splits at deeper levels.

d. Continue splitting until reaching maximum depth or minimum samples per leaf.

After three levels of splitting on different features, our first tree has created eight distinct groups, each with its own prediction for the residuals.

2.3. Calculate Leaf Values
For each leaf, find the mean of residuals.

Each leaf in our first tree contains an average of the residuals in that group — these values will be used to adjust and improve our initial mean prediction of 37.43.

2.4. Update Predictions
· For each data point in the training dataset, determine which leaf it falls into based on the new tree.

Running our training data through the first tree, each sample follows its own path based on weather features to get its predicted residual value, which will help correct our initial prediction.

· Multiply the new tree’s predictions by the learning rate and add these scaled predictions to the current model’s predictions. This will be the updated prediction.

Our model updates its predictions by taking small steps: it adds just 10% (our learning rate of 0.1) of each predicted residual to our initial prediction of 37.43, creating slightly improved predictions.

For the Second Tree

2.1. Calculate new residuals based on current model
a. Compute the difference between the target and current predictions.
These residuals will be a bit different from the first iteration.

After updating our predictions with the first tree, we calculate new residuals — notice how they’re slightly smaller than the original ones, showing our predictions are gradually improving.

2.2. Build a new tree to predict these residuals. Same process as first tree, but targeting new residuals.

Starting our second tree to predict the new, smaller residuals — we’ll use the same tree-building process as before, but now we’re trying to catch the errors our first tree missed

2.3. Calculate the mean residuals for each leaf

The second tree follows an identical structure to our first tree with the same weather features and split points, but with smaller values in its leaves — showing we’re fine-tuning the remaining errors.

2.4. Update model predictions
· Multiply the new tree’s predictions by the learning rate.
· Add the new scaled tree predictions to the running total.

After running our data through the second tree, we again take small steps with our 0.1 learning rate to update predictions, and calculate new residuals that are even smaller than before — our model is gradually learning the patterns.

For the Third Tree onwards

Repeat Steps 2.1–2.3 for remaining iterations. Note that each tree sees different residuals.
· Trees progressively focus on harder-to-predict patterns
· Learning rate prevents overfitting by limiting each tree’s contribution

As we build more trees, notice how the split points slowly shift and the residual values in the leaves get smaller — by tree 50, we’re making tiny adjustments using different combinations of features compared to our first trees.
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor

# Train the model
clf = GradientBoostingRegressor(criterion='squared_error', learning_rate=0.1, random_state=42)
clf.fit(X_train, y_train)

# Plot trees 1, 2, 49, and 50
plt.figure(figsize=(11, 20), dpi=300)

for i, tree_idx in enumerate([0, 2, 24, 49]):
plt.subplot(4, 1, i+1)
plot_tree(clf.estimators_[tree_idx,0],
feature_names=X_train.columns,
impurity=False,
filled=True,
rounded=True,
precision=2,
fontsize=12)
plt.title(f'Tree {tree_idx + 1}')

plt.suptitle('Decision Trees from GradientBoosting', fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Visualization from scikit-learn shows how our gradient boosting trees evolve: from Tree 1 making large splits with big prediction values, to Tree 50 making refined splits with tiny adjustments — each tree focuses on correcting the remaining errors from previous trees.

Testing Step

For predicting:
a. Start with the initial prediction (the average number of players)
b. Run the input through each tree to get its predicted adjustment
c. Scale each tree’s prediction by the learning rate.
d. Add all these adjustments to the initial prediction
e. The sum directly gives us the predicted number of players

When predicting on unseen data, each tree contributes its small prediction, starting from 5.57 in Tree 1 down to 0.008 in Tree 50 — all these predictions are scaled by our 0.1 learning rate and added to our base prediction of 37.43 to get the final answer.

Evaluation Step

After building all the trees, we can evaluate the test set.

Our gradient boosting model achieves an RMSE of 4.785, quite an improvement over a single regression tree’s 5.27 — showing how combining many small corrections leads to better predictions than one complex tree!
# Get predictions
y_pred = clf.predict(X_test)

# Create DataFrame with actual and predicted values
results_df = pd.DataFrame({
'Actual': y_test,
'Predicted': y_pred
})
print(results_df) # Display results DataFrame

# Calculate and display RMSE
from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(y_test, y_pred)
print(f"\nModel Accuracy: {rmse:.4f}")

Key Parameters

Here are the key parameters for Gradient Boosting, particularly in scikit-learn:

max_depth: The depth of trees used to model residuals. Unlike AdaBoost which uses stumps, Gradient Boosting works better with deeper trees (typically 3-8 levels). Deeper trees capture more complex patterns but risk overfitting.

n_estimators: The number of trees to be used (typically 100-1000). More trees usually improve performance when paired with a small learning rate.

learning_rate: Also called “shrinkage”, this scales each tree’s contribution (typically 0.01-0.1). Smaller values require more trees but often give better results by making the learning process more fine-grained.

subsample: The fraction of samples used to train each tree (typically 0.5-0.8). This optional feature adds randomness that can improve robustness and reduce overfitting.

These parameters work together: a small learning rate needs more trees, while deeper trees might need a smaller learning rate to avoid overfitting.

Key differences from AdaBoost

Both AdaBoost and Gradient Boosting are boosting algorithms, but the way they learn from their mistakes are different. Here are the key differences:

  1. max_depth is typically higher (3-8) in Gradient Boosting, while AdaBoost prefers stumps.
  2. No sample_weight updates because Gradient Boosting uses residuals instead of sample weighting.
  3. The learning_rate is typically much smaller (0.01-0.1) compared to AdaBoost’s larger values (0.1-1.0).
  4. Initial prediction starts from the mean while AdaBoost starts from zero.
  5. Trees are combined through simple addition rather than weighted voting, making each tree’s contribution more straightforward.
  6. Optional subsample parameter adds randomness, a feature not present in standard AdaBoost.

Pros:

  • Step-by-Step Error Fixing: In Gradient Boosting, each new tree focuses on correcting the mistakes made by the previous ones. This makes the model better at improving its predictions in areas where it was previously wrong.
  • Flexible Error Measures: Unlike AdaBoost, Gradient Boosting can optimize different types of error measurements (like mean absolute error, mean squared error, or others). This makes it adaptable to various kinds of problems.
  • High Accuracy: By using more detailed trees and carefully controlling the learning rate, Gradient Boosting often provides more accurate results than other algorithms, especially for well-structured data.

Cons:

  • Risk of Overfitting: The use of deeper trees and the sequential building process can cause the model to fit the training data too closely, which may reduce its performance on new data. This requires careful tuning of tree depth, learning rate, and the number of trees.
  • Slow Training Process: Like AdaBoost, trees must be built one after another, making it slower to train compared to algorithms that can build trees in parallel, like Random Forest. Each tree relies on the errors of the previous ones.
  • High Memory Use: The need for deeper and more numerous trees means Gradient Boosting can consume more memory than simpler boosting methods such as AdaBoost.
  • Sensitive to Settings: The effectiveness of Gradient Boosting heavily depends on finding the right combination of learning rate, tree depth, and number of trees, which can be more complex and time-consuming than tuning simpler algorithms.

Gradient Boosting is a major improvement in boosting algorithms. This success has led to popular versions like XGBoost and LightGBM, which are widely used in machine learning competitions and real-world applications.

While Gradient Boosting requires more careful tuning than simpler algorithms — especially when adjusting the depth of decision trees, the learning rate, and the number of trees — it is very flexible and powerful. This makes it a top choice for problems with structured data.

Gradient Boosting can handle complex relationships that simpler methods like AdaBoost might miss. Its continued popularity and ongoing improvements show that the approach of using gradients and building models step-by-step remains highly important in modern machine learning.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast',
'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain',
'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast',
'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temp.': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humid.': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Num_Players': [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29,
25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]
}

# Prepare data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='')
df['Wind'] = df['Wind'].astype(int)

# Split features and target
X, y = df.drop('Num_Players', axis=1), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Train Gradient Boosting
gb = GradientBoostingRegressor(
n_estimators=50, # Number of boosting stages (trees)
learning_rate=0.1, # Shrinks the contribution of each tree
max_depth=3, # Depth of each tree
subsample=0.8, # Fraction of samples used for each tree
random_state=42
)
gb.fit(X_train, y_train)

# Predict and evaluate
y_pred = gb.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred))

print(f"Root Mean Squared Error: {rmse:.2f}")

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here