Creating Powerful Ensemble Models with PyCaret

Creating Powerful Ensemble Models with PyCaret
Image by Editor | Canva

Machine learning is changing how we solve problems. However, no single model is perfect. Models can struggle with overfitting, underfitting, or bias, reducing prediction accuracy. Ensemble learning solves this by combining predictions from multiple models, using the strengths of each model while reducing weaknesses. This results in more accurate and reliable predictions.

PyCaret helps simplify ensemble model building with a user-friendly interface, handling data preprocessing, model creation, tuning, and evaluation. PyCaret allows easy creation, comparison, and optimization of ensemble models, and makes machine learning accessible to nearly everyone.

In this article, we will explore how to create ensemble models with PyCaret.

Why Use Ensemble Models?

As stated, some of the issues of machine learning models is that they can overfit, underfit, or make biased predictions. Ensemble models solve these problems by combining multiple models. Benefits of ensembling include:

Improved Accuracy: Combining predictions from multiple models generally yields better results than using a single model
Reduced Overfitting: Ensemble models can generalize better by reducing the impact of outlier predictions from individual models
Increased Robustness: Aggregating diverse models makes predictions more stable and reliable

Types of Ensemble Techniques

Ensemble techniques combine multiple models to overcome the potential drawbacks associated with single models. The main ensemble techniques are bagging, boosting, stacking, and voting and averaging.

Bagging (Bootstrap Aggregating)

Bagging reduces variance by training multiple models on different data subsets. These subsets are created by random sampling with replacement. Each model is trained independently, and predictions are combined by averaging (for regression) or voting (for classification). Bagging helps reduce overfitting and makes predictions more stable. Random Forest is a type of bagging applied to decision trees.

Boosting

Boosting reduces bias and variance by training models in sequence, with each new model learns from the mistakes of the previous one. Misclassified points get higher weights to focus learning. Boosting combines weak models, like shallow decision trees, into a strong one. Boosting works well for complex datasets but needs careful tuning. Popular algorithms include AdaBoost, XGBoost, and LightGBM.

Stacking

Stacking combines different models to leverage their strengths, after which a meta-model is trained on the predictions of base models to make the final prediction. The meta-model learns how to combine the base models’ predictions for better accuracy. Stacking handles diverse patterns but is computationally intensive and needs validation to avoid overfitting.

Voting and Averaging

Voting and averaging combine predictions from multiple models without a meta-model. In voting (for classification), predictions are combined by majority rule (hard voting) or by averaging probabilities (soft voting). In averaging (for regression), model predictions are averaged. These methods are simple to implement and work well when base models are strong and diverse, and are often used as baseline ensemble techniques.

Install PyCaret

First install PyCaret using pip:

Preparing the Data

For this tutorial, we will use the popular Diabetes dataset for classification.

from pycaret.datasets import get_data from pycaret.classification import * # Load the dataset data = get_data(‘diabetes’) # Split the dataset into training and testing sets from sklearn.model_selection import train_test_split train, test = train_test_split(data, test_size=0.2, random_state=123)

from pycaret.datasets import get_data

from pycaret.classification import *

# Load the dataset

data = get_data(‘diabetes’)

# Split the dataset into training and testing sets

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, random_state=123)

dataset dataset

Setting Up the Environment

The setup() function initializes the PyCaret environment by performing data preprocessing tasks like handling missing values, scaling, and encoding.

# Initialize the PyCaret environment exp = setup(data=train, target=”Class variable”, session_id=123)

# Initialize the PyCaret environment

exp = setup(data=train, target=‘Class variable’, session_id=123)

Some of the important setup parameters include:

data: the training dataset
target: the name of the target column
session_id: sets the random seed for reproducibility

setup setup

Comparing Base Models

PyCaret allows you to compare multiple base models and select the best candidates for ensemble modeling.

# Compare models and rank them based on performance best_models = compare_models(n_select=3)

# Compare models and rank them based on performance

best_models = compare_models(n_select=3)

Here’s what’s going on:

compare_models() evaluates all available models and ranks them based on default metrics like accuracy or AUC
n_select=3 selects the top 3 models for further use

compare_models

Creating Bagging and Boosting Models

You can create a bagging ensemble using PyCaret’s create_model() function:

# Create a Random Forest model rf_model = create_model(‘rf’)

# Create a Random Forest model

rf_model = create_model(‘rf’)

bagging bagging

Boosting models can be created in a similar way:

# Create a Gradient Boosting model gb_model = create_model(‘gbc’)

# Create a Gradient Boosting model

gb_model = create_model(‘gbc’)

boosting

Creating a Stacking Ensemble

Stacking ensembles combine predictions from multiple models using a meta-model. They are created in the straightforward following way:

# Create a Stacking ensemble using top 3 models stacked_model = stack_models(best_models)

# Create a Stacking ensemble using top 3 models

stacked_model = stack_models(best_models)

stacking

Here, stack_models() combines the predictions from the models in best_models using a meta-model — the default is logistic regression for classification.

Creating a Voting Ensemble

Voting aggregates predictions by majority voting (classification) or averaging (regression).

# Create a Voting ensemble using top 3 models voting_model = blend_models(best_models)

# Create a Voting ensemble using top 3 models

voting_model = blend_models(best_models)

voting voting

In the above, blend_models() automatically combines the predictions of the selected models into a single ensemble.

Evaluate Model

You can evaluate ensemble models using the evaluate_model() function. It provides various visualizations like ROC-AUC, precision-recall, and confusion matrix. Here, lets evaluate stacked model and view the confusion matrix.

# Evaluate each model evaluate_model(stacked_model)

# Evaluate each model

evaluate_model(stacked_model)

evaluate_model

Best Practices for Ensemble Modeling

For the best shot at high quality results, keep the following best practices in mind when creating your ensemble models.

Ensure Model Diversity: Use different model types and vary hyperparameters to increase diversity
Limit Model Complexity: Avoid overly complex models to prevent overfitting and use regularization techniques
Monitor Ensemble Size: Avoid unnecessary models and ensure that adding more models improves performance
Handle Class Imbalance: Address class imbalance using techniques like oversampling or weighted loss functions
Ensemble Model Fusion: Combine different ensemble methods (e.g., stacking and bagging) for better results

Conclusion

Ensemble models improve machine learning performance by combining multiple models, and PyCaret simplifies this process with easy-to-use functions. You can create bagging, boosting, stacking, and voting ensembles effortlessly with the library, which also supports hyperparameter tuning for better results. Evaluate your models to choose the best one, and then save your ensemble models for future use or deployment. When following best practices, ensemble learning combined with PyCaret can help you build powerful models quickly and efficiently.

Creating Powerful Ensemble Models with PyCaret

Why Use Ensemble Models?

Types of Ensemble Techniques

Bagging (Bootstrap Aggregating)

Boosting

Stacking

Voting and Averaging

Install PyCaret

Preparing the Data

Setting Up the Environment

Comparing Base Models

Creating Bagging and Boosting Models

Creating a Stacking Ensemble

Creating a Voting Ensemble

Evaluate Model

Best Practices for Ensemble Modeling

Conclusion

Recent Articles

How to prevent AI-based data incidents

Libra Crypto Creator Reportedly Claimed He Could ‘Control’ Argentine President Javier Milei

Machine Learning 101 P9: K-nearest neighbors (kNN) | by Hang Nguyen | Feb, 2025

Motion Highlights #2 | Codrops

Understanding RAG Part V: Managing Context Length

Related Stories

Leave A Reply Cancel reply