Creating Powerful Ensemble Models with PyCaret
Image by Editor | Canva
Machine learning is changing how we solve problems. However, no single model is perfect. Models can struggle with overfitting, underfitting, or bias, reducing prediction accuracy. Ensemble learning solves this by combining predictions from multiple models, using the strengths of each model while reducing weaknesses. This results in more accurate and reliable predictions.
PyCaret helps simplify ensemble model building with a user-friendly interface, handling data preprocessing, model creation, tuning, and evaluation. PyCaret allows easy creation, comparison, and optimization of ensemble models, and makes machine learning accessible to nearly everyone.
In this article, we will explore how to create ensemble models with PyCaret.
Why Use Ensemble Models?
As stated, some of the issues of machine learning models is that they can overfit, underfit, or make biased predictions. Ensemble models solve these problems by combining multiple models. Benefits of ensembling include:
- Improved Accuracy: Combining predictions from multiple models generally yields better results than using a single model
- Reduced Overfitting: Ensemble models can generalize better by reducing the impact of outlier predictions from individual models
- Increased Robustness: Aggregating diverse models makes predictions more stable and reliable
Types of Ensemble Techniques
Ensemble techniques combine multiple models to overcome the potential drawbacks associated with single models. The main ensemble techniques are bagging, boosting, stacking, and voting and averaging.
Bagging (Bootstrap Aggregating)
Bagging reduces variance by training multiple models on different data subsets. These subsets are created by random sampling with replacement. Each model is trained independently, and predictions are combined by averaging (for regression) or voting (for classification). Bagging helps reduce overfitting and makes predictions more stable. Random Forest is a type of bagging applied to decision trees.
Boosting
Boosting reduces bias and variance by training models in sequence, with each new model learns from the mistakes of the previous one. Misclassified points get higher weights to focus learning. Boosting combines weak models, like shallow decision trees, into a strong one. Boosting works well for complex datasets but needs careful tuning. Popular algorithms include AdaBoost, XGBoost, and LightGBM.
Stacking
Stacking combines different models to leverage their strengths, after which a meta-model is trained on the predictions of base models to make the final prediction. The meta-model learns how to combine the base models’ predictions for better accuracy. Stacking handles diverse patterns but is computationally intensive and needs validation to avoid overfitting.
Voting and Averaging
Voting and averaging combine predictions from multiple models without a meta-model. In voting (for classification), predictions are combined by majority rule (hard voting) or by averaging probabilities (soft voting). In averaging (for regression), model predictions are averaged. These methods are simple to implement and work well when base models are strong and diverse, and are often used as baseline ensemble techniques.
Install PyCaret
First install PyCaret using pip:
Preparing the Data
For this tutorial, we will use the popular Diabetes dataset for classification.
from pycaret.datasets import get_data from pycaret.classification import *
# Load the dataset data = get_data(‘diabetes’)
# Split the dataset into training and testing sets from sklearn.model_selection import train_test_split train, test = train_test_split(data, test_size=0.2, random_state=123) |
Setting Up the Environment
The setup() function initializes the PyCaret environment by performing data preprocessing tasks like handling missing values, scaling, and encoding.
# Initialize the PyCaret environment exp = setup(data=train, target=‘Class variable’, session_id=123) |
Some of the important setup parameters include:
- data: the training dataset
- target: the name of the target column
- session_id: sets the random seed for reproducibility
Comparing Base Models
PyCaret allows you to compare multiple base models and select the best candidates for ensemble modeling.
# Compare models and rank them based on performance best_models = compare_models(n_select=3) |
Here’s what’s going on:
- compare_models() evaluates all available models and ranks them based on default metrics like accuracy or AUC
- n_select=3 selects the top 3 models for further use
Creating Bagging and Boosting Models
You can create a bagging ensemble using PyCaret’s create_model() function:
# Create a Random Forest model rf_model = create_model(‘rf’) |
Boosting models can be created in a similar way:
# Create a Gradient Boosting model gb_model = create_model(‘gbc’) |
Creating a Stacking Ensemble
Stacking ensembles combine predictions from multiple models using a meta-model. They are created in the straightforward following way:
# Create a Stacking ensemble using top 3 models stacked_model = stack_models(best_models) |
Here, stack_models() combines the predictions from the models in best_models using a meta-model — the default is logistic regression for classification.
Creating a Voting Ensemble
Voting aggregates predictions by majority voting (classification) or averaging (regression).
# Create a Voting ensemble using top 3 models voting_model = blend_models(best_models) |
In the above, blend_models() automatically combines the predictions of the selected models into a single ensemble.
Evaluate Model
You can evaluate ensemble models using the evaluate_model() function. It provides various visualizations like ROC-AUC, precision-recall, and confusion matrix. Here, lets evaluate stacked model and view the confusion matrix.
# Evaluate each model evaluate_model(stacked_model) |
Best Practices for Ensemble Modeling
For the best shot at high quality results, keep the following best practices in mind when creating your ensemble models.
- Ensure Model Diversity: Use different model types and vary hyperparameters to increase diversity
- Limit Model Complexity: Avoid overly complex models to prevent overfitting and use regularization techniques
- Monitor Ensemble Size: Avoid unnecessary models and ensure that adding more models improves performance
- Handle Class Imbalance: Address class imbalance using techniques like oversampling or weighted loss functions
- Ensemble Model Fusion: Combine different ensemble methods (e.g., stacking and bagging) for better results
Conclusion
Ensemble models improve machine learning performance by combining multiple models, and PyCaret simplifies this process with easy-to-use functions. You can create bagging, boosting, stacking, and voting ensembles effortlessly with the library, which also supports hyperparameter tuning for better results. Evaluate your models to choose the best one, and then save your ensemble models for future use or deployment. When following best practices, ensemble learning combined with PyCaret can help you build powerful models quickly and efficiently.