Boosting Over Bagging: Enhancing Predictive Accuracy with Gradient Boosting Regressors


Ensemble learning techniques primarily fall into two categories: bagging and boosting. Bagging improves stability and accuracy by aggregating independent predictions, whereas boosting sequentially corrects the errors of prior models, improving their performance with each iteration. This post begins our deep dive into boosting, starting with the Gradient Boosting Regressor. Through its application on the Ames Housing Dataset, we will demonstrate how boosting uniquely enhances models, setting the stage for exploring various boosting techniques in upcoming posts.

Let’s get started.

Boosting Over Bagging: Enhancing Predictive Accuracy with Gradient Boosting Regressors
Photo by Erol Ahmed. Some rights reserved.

Overview

This post is divided into four parts; they are:

  • What is Boosting?
  • Comparing Model Performance: Decision Tree Baseline to Gradient Boosting Ensembles
  • Optimizing Gradient Boosting with Learning Rate Adjustments
  • Final Optimization: Tuning Learning Rate and Number of Trees

What is Boosting?

Boosting is an ensemble technique combining multiple models to create a strong learner. Unlike other ensemble methods that may build models in parallel, boosting adds models sequentially, with each new model focusing on improving the areas where previous models struggled. This methodically improves the ensemble’s accuracy with each iteration, making it particularly effective for complex datasets.

Key Features of Boosting:

  • Sequential Learning: Boosting builds one model at a time. Each new model learns from the shortcomings of the previous ones, allowing for progressive improvement in capturing data complexities.
  • Error Correction: New learners focus on previously mispredicted instances, continuously enhancing the ensemble’s capability to capture difficult patterns in the data.
  • Model Complexity: The ensemble’s complexity grows as more models are added, enabling it to capture intricate data structures effectively.

Boosting vs. Bagging

Bagging involves building several models (often independently) and combining their outputs to enhance the ensemble’s overall performance, primarily by reducing the risk of overfitting the noise in the training data, in contrast, boosting focuses on improving the accuracy of predictions by learning from errors sequentially, which allows it to adapt more intricately to the data.

Boosting Regressors in scikit-learn:

Scikit-learn provides several implementations of boosting, tailored for different needs and data scenarios:

  • AdaBoost Regressor: Employs a sequence of weak learners and adjusts their focus based on the errors of the previous model, improving where past models were lacking.
  • Gradient Boosting Regressor: Builds models one at a time, with each new model trained to correct the residuals (errors) made by the previous ones, improving accuracy through careful adjustments.
  • HistGradient Boosting Regressor: An optimized form of Gradient Boosting designed for larger datasets, which speeds up calculations by using histograms to approximate gradients.

Each method utilizes the core principles of boosting to improve its components’ performance, showcasing the versatility and power of this approach in tackling predictive modeling challenges. In the following sections of this post, we will demonstrate a practical application of the Gradient Boosting Regressor using the Ames Housing Dataset.

Comparing Model Performance: Decision Tree Baseline to Gradient Boosting Ensembles

In transitioning from the theoretical aspects of boosting to its practical applications, this section will demonstrate the Gradient Boosting Regressor using the meticulously preprocessed Ames Housing Dataset. Our preprocessing steps, consistent across various tree-based models, ensure that the improvements observed can be attributed directly to the model’s capabilities, setting the stage for an effective comparison.

The code below establishes our comparative analysis framework by first setting up a baseline using a single Decision Tree, which is not an ensemble method. This baseline will allow us to illustrate the incremental benefits brought by actual ensemble methods clearly. Following this, we configure two versions, each of Bagging, Random Forest, and the Gradient Boosting Regressor, with 100 and 200 trees, respectively, to explore the enhancements these ensemble techniques offer over the baseline.

Below are the cross-validation results, showcasing how each model performs in terms of mean R² values:

The results from our ensemble models underline several key insights into the behavior and performance of advanced regression techniques:

  • Baseline and Enhancement: Starting with a basic Decision Tree Regressor, which serves as our baseline with an R² of 0.7663, we observe significant performance uplifts as we introduce more complex models. Both Bagging and Random Forest Regressors, using different numbers of trees, show improved scores, illustrating the power of ensemble methods in leveraging multiple learning models to reduce error.
  • Gradient Boosting Regressor’s Edge: Particularly notable is the Gradient Boosting Regressor. With its default setting of 100 trees, it achieves an R² of 0.9027, and further increasing the number of trees to 200 nudges the score up to 0.9061. This indicates the effectiveness of GBR in this context and highlights its efficiency in sequential improvement from additional learners.
  • Marginal Gains from More Trees: While increasing the number of trees generally results in better performance, the incremental gains diminish as we expand the ensemble size. This trend is evident across Bagging, Random Forest, and Gradient Boosting models, suggesting a point of diminishing returns where additional computational resources yield minimal performance improvements.

The results highlight the Gradient Boosting Regressor’s robust performance. It effectively leverages comprehensive preprocessing and the sequential improvement strategy characteristic of boosting. Next, we will explore how adjusting the learning rate can refine our model’s performance, enhancing its predictive accuracy.

Optimizing Gradient Boosting with Learning Rate Adjustments

The learning_rate is unique to boosting models like the Gradient Boosting Regressor, distinguishing it from other models such as Decision Trees and Random Forests, which do not have a direct equivalent of this parameter. Adjusting the learning_rate allows us to delve deeper into the mechanics of boosting and enhance our model’s predictive power by fine-tuning how aggressively it learns from each successive tree.

What is the Learning Rate?

In the context of Gradient Boosting Regressors and other gradient descent-based algorithms, the “learning rate” is a crucial hyperparameter that controls the speed at which the model learns. At its core, the learning rate influences the size of the steps the model takes toward the optimal solution during training. Here’s a breakdown:

  • Size of Steps: The learning rate determines the magnitude of the updates to the model’s weights during training. A higher learning rate makes larger updates, allowing the model to learn faster but at the risk of overshooting the optimal solution. Conversely, a lower learning rate makes smaller updates, which means the model learns slower but with potentially higher precision.
  • Impact on Model Training:
    • Convergence: A learning rate that is too high may cause the training process to converge too quickly to a suboptimal solution, or it might not converge at all as it overshoots the minimum.
    • Accuracy and Overfitting: A learning rate that is too low can lead the model to learn too slowly, which may require more trees to achieve similar accuracy, potentially leading to overfitting if not monitored.
  • Tuning: Choosing the right learning rate balances speed and accuracy. It is often selected through trial and error or more systematic approaches like GridSearchCV and RandomizedSearchCV, as adjusting the learning rate can significantly affect the model’s performance and training time.

By adjusting the learning rate, data scientists can control how quickly a boosting model adapts to the complexity of its errors. This makes the learning rate a powerful tool in fine-tuning model performance, especially in boosting algorithms where each new tree is built to correct the residuals (errors) left by the previous trees.

To optimize the learning_rate, we start with GridSearchCV, a systematic method that will explore predefined values ([0.001, 0.01, 0.1, 0.2, 0.3]) to ascertain the most effective setting for enhancing our model’s accuracy.

Here are the results from our GridSearchCV, focused solely on optimizing the learning_rate parameter:

Using GridSearchCV, we found that a learning_rate of 0.1 yielded the best result, matching the default setting. This suggests that for our dataset and preprocessing setup, increasing or decreasing the rate around this value does not significantly improve the model.

Following this, we utilize RandomizedSearchCV to expand our search. Unlike GridSearchCV, RandomizedSearchCV randomly selects from a continuous range, allowing for a potentially more precise optimization by exploring between the standard values, thus providing a comprehensive understanding of how subtle variations in learning_rate can impact performance.

Contrasting with GridSearchCV, RandomizedSearchCV identified a slightly different optimal learning_rate of approximately 0.158, which enhanced our model’s performance. This improvement underscores the value of a randomized search, particularly when fine-tuning models, as it can explore a more diverse set of possibilities and potentially yield better configurations.

The optimization through RandomizedSearchCV has demonstrated its efficacy by pinpointing a learning rate that pushes our model’s performance to new heights, achieving an R² score of 0.9134. These experiments with learning_rate adjustments through GridSearchCV and RandomizedSearchCV illustrate the delicate balance required in tuning gradient boosting models. They also highlight the benefits of exploring both systematic and randomized parameter search strategies to optimize a model fully.

Encouraged by the gains achieved through these optimization strategies, we will now extend our focus to fine-tuning both the learning_rate and n_estimators simultaneously. This next phase aims to uncover even more optimal settings by exploring the combined impact of these crucial parameters on our Gradient Boosting Regressor’s performance.

Final Optimization: Tuning Learning Rate and Number of Trees

Building on our previous findings, we now advance to a more comprehensive optimization approach that involves simultaneously tuning both learning_rate and n_estimators. This dual-parameter tuning is designed to explore how these parameters work together, potentially enhancing the performance of the Gradient Boosting Regressor even further.

We begin with GridSearchCV to systematically explore combinations of learning_rate and n_estimators. This approach provides a structured way to assess the impact of varying both parameters on our model’s accuracy.

The GridSearchCV process evaluated 25 different combinations across 5 folds, totaling 125 fits:

It confirmed that a learning_rate of 0.1—the default setting—remains effective. However, it suggested an increase to 500 trees could slightly improve our model’s performance, elevating the R² score to 0.9089. This is a modest enhancement compared to the R² of 0.9061 achieved earlier with 200 trees and a learning_rate of 0.1. Interestingly, our previous randomized search yielded an even better result of 0.9134 with only 200 trees and  learning_rate approximately 0.158, illustrating the potential benefits of exploring a broader parameter space to optimize performance.

To ensure that we have thoroughly explored the parameter space and to uncover even better configurations potentially, we’ll now employ RandomizedSearchCV. This method allows for a more explorative and less deterministic approach by sampling from a continuous distribution of parameter values.

The RandomizedSearchCV extended our search across a broader range of possibilities, testing 50 different configurations across 5 folds, totaling 250 fits:

It identified an even more effective setting with a learning_rate of approximately 0.121 and n_estimators at 287, achieving our best R² score yet at 0.9158. This underscores the potential of randomized parameter tuning to discover optimal settings that more rigid methods might miss.

To validate the performance improvements achieved through our tuning efforts, we will now perform a final cross-validation using the Gradient Boosting Regressor configured with the best parameters identified: n_estimators set to 287 and a learning_rate of approximately 0.121.

The final output confirms the performance of our tuned Gradient Boosting Regressor.

By optimizing both learning_rate and n_estimators, we have achieved an R² score of 0.9158. This score not only validates the enhancements made through parameter tuning but also emphasizes the capability of the Gradient Boosting Regressor to adapt and perform consistently across the dataset.

APIs

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

This post explored the capabilities of the Gradient Boosting Regressor (GBR), from understanding the foundational concepts of boosting to advanced optimization techniques using the Ames Housing Dataset. It focused on key parameters of the GBR such as the number of trees and learning rate, essential for refining the model’s accuracy and efficiency. Through systematic and randomized approaches, it demonstrated how to fine-tune these parameters using GridSearchCV and RandomizedSearchCV, enhancing the model’s performance significantly.

Specifically, you learned:

  • The fundamentals of boosting and how it differs from other ensemble techniques like bagging.
  • How to achieve incremental improvements by experimenting with a range of models.
  • Techniques for tuning learning rate and number of trees for the Gradient Boosting Regressor.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on The Beginner’s Guide to Data Science!

The Beginner's Guide to Data ScienceThe Beginner's Guide to Data Science

Learn the mindset to become successful in data science projects

…using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner’s Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more…all to support you in creating a narrative from a dataset.

Kick-start your data science journey with hands-on exercises

See What’s Inside

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here