Building a custom model pipeline in PyCaret can help make machine learning easier. PyCaret is able to automate many steps, including data preparation and model training. It can also allow you to create and use your own custom models.
In this article, we will build a custom machine learning pipeline step by step using PyCaret.
What is PyCaret?
PyCaret is a tool that automates machine learning workflows. It handles repetitive tasks such as scaling data, encoding variables, and tuning hyperparameters. PyCaret supports many machine learning tasks, including:
- Classification (predict categories)
- Regression (predict numbers)
- Clustering (group data)
- Anomaly detection (identify outliers)
PyCaret works well with popular libraries like scikit-learn, XGBoost, and LightGBM.
Setting Up the Environment
First, install PyCaret using pip:
Next, import the correct module for your task:
from pycaret.classification import *  # For classification tasks  from pycaret.regression import *      # For regression tasks |
Preparing the Data
Before starting a machine learning project, you need to prepare the data. PyCaret works well with Pandas, and this combination can be used to help you with your data preparation.
Here’s how to load and explore the Iris dataset:
from sklearn.datasets import load_iris import pandas as pd  iris = load_iris() data = pd.DataFrame(iris.data, columns=iris.feature_names) data[‘target’] = iris.target |
Ensure your data is clean and contains a target column — in our case, this is iris.target. This is the variable you want to predict.
Setting Up the PyCaret Environment
PyCaret’s setup() function prepares your data for training. It handles tasks such as:
- Fill missing values: Automatically replaces missing data with appropriate values
- Encode categorical variables: Converts non-numerical categories into numbers
- Scale numerical features: Normalizes data to ensure uniformity
Here’s how to set it up:
from pycaret.classification import setup  # Initialize the environment exp1 = setup(data, target=‘target’) |
Some important setup parameters that deserve being mentioned include:
- preprocess=True/False: this is for controlling preprocessing
- session_id: this allows for reproducibility
- fold: this allows for describing and using a cross-validation strategy
- fix_imbalance=True: this parameter allows for the handling of imbalanced datasets
In summary, this step prepares the data and creates a foundation for training models.
Available Models
PyCaret provides a range of machine learning algorithms. You can view a list of supported models using the models() function:
# List available models models() |
This function generates a table showing each model’s name, a short identifier (ID), and a brief description. Users can quickly view and subsequently assess which algorithms are suitable for their task.
Comparing Models
The compare_models() function evaluates and ranks multiple models based on their performance metrics, and is one of PyCaret’s great many beneficial workflow functions. It helps identify the best model for your dataset by comparing models using metrics like:
- Accuracy: For classification tasks
- R-squared: For regression tasks
Here’s how to use it:
# Compare models and find the best one best_model = compare_models() Â # Print the best model print(best_model) |
This will compare all the available models using default hyperparameters and print the details of the best model based on the performance metric. The best_model object will contain the model with the best performance score.
Creating the Model
After comparing models with compare_models(), you can create the best model using the create_model() function.
# Train the best model model = create_model(best_model) |
This function trains the selected model on your dataset.
Hyperparameter Tuning
Fine-tuning your model’s parameters can significantly improve its performance. PyCaret automates this process with smart search strategies.
# Tune model with random search tuned_model = tune_model(model, n_iter=50, optimize=‘Accuracy’) Â # Use specific search grid tuned_model = tune_model(model, custom_grid= Â Â Â Â ‘n_estimators’: [100, 200, 300], Â Â Â Â ‘max_depth’: [3, 5, 7] ) |
PyCaret automatically performs cross-validation during tuning and selects the best parameters based on your chosen metric. You can also specify custom parameter grids for more control over the tuning process.
tune_model() also supports different tuning strategies such as grid search and Bayesian optimization:
# Grid search tuned_model = tune_model(model, search_library=‘scikit-learn’, search_algorithm=‘grid’) Â # Bayesian optimization tuned_model = tune_model(model, search_library=‘optuna’) |
Evaluating the Models
It’s important to evaluate a model’s performance to understand its behavior on unseen data. PyCaret’s evaluate_model() function provides a detailed, interactive review of the model’s performance.
Here are some common evaluation plots available in PyCaret for model evaluation.
Confusion Matrix
The confusion matrix shows how well the model classifies each category in the dataset. It compares the predicted labels against the true labels. This plot helps you understand the errors in the classification.
# Plot confusion matrix plot_model(tuned_model, plot=‘confusion_matrix’) |
ROC Curve
The ROC curve (Receiver Operating Characteristic curve) shows the trade-off between the True Positive Rate (sensitivity) and the False Positive Rate (1 – specificity) at various threshold settings. It is useful for evaluating classification models, especially when there is class imbalance.
# Plot ROC curve plot_model(tuned_model, plot=‘roc’) |
Learning Curve
The learning curve shows how the model’s performance improves as the number of training samples increases. It can help you identify if the model is underfitting or overfitting.
# Plot learning curve plot_model(tuned_model, plot=‘learning’) |
Model Interpretation
Understanding how your model makes decisions is important for both debugging and building trust. PyCaret provides several tools for model interpretation.
# Get feature importance interpret_model(model, plot=‘feature’) Â # Generate SHAP values interpret_model(model, plot=‘summary’) Â # Create correlation analysis interpret_model(model, plot=‘correlation’) |
These visualizations help explain which features influence your model’s predictions most strongly. For classification tasks, you can also analyze decision boundaries and confusion matrices to understand model behavior.
Saving and Loading Custom Models
After training and fine-tuning a model, you’ll often want to save it for later use. PyCaret makes this process straightforward. In order to properly save a model, however, you will need to save the preprocessing pipeline as well. Accomplish both of these processes with the below code.
# Train and tune your model model = create_model(‘rf’) tuned_model = tune_model(model) Â # Save model save_model(tuned_model, ‘final_model’, prep_pipeline=True) Â # Load model loaded_model = load_model(‘final_model’) Â # Use model predictions = predict_model(loaded_model, new_data) |
What’s happening:
- save_model(tuned_model, ‘final_model’, prep_pipeline=True): saves your tuned_model to file final_model.pkl along with its associated preprocessing pipeline
- loaded_model = (‘final_model’): loads the saved model to loaded_model
- predictions = predict_model(loaded_model, new_data): use the model while automatically applying preprocessing using the saved pipeline
Creating Production Pipelines
Moving from experimentation and model-building to production and model-deployment requires robust, reproducible pipelines. PyCaret simplifies this transition with built-in pipeline creation.
# Create deployment pipeline final_pipeline = pipeline_model(model) Â # Add custom transformers from sklearn.preprocessing import StandardScaler pipeline = pipeline_model(model, transformation_pipe=[StandardScaler()]) Â # Export pipeline for deployment save_model(pipeline, ‘production_ready_model’) |
These pipelines ensure that all preprocessing steps, feature engineering, and model inference happen in the correct order, making deployment more reliable.
Production Deployment
Deploying models to production environments requires careful handling of both model artifacts and preprocessing steps. PyCaret provides tools to make this process seamless.
# Save complete pipeline deployment_ready_model = save_model(final_pipeline, ‘production_model’) Â # Example production usage loaded_pipeline = load_model(‘production_model’) predictions = predict_model(loaded_pipeline, new_data) Â # Monitor model performance predictions = predict_model(loaded_pipeline, new_data, raw_score=True) print(predictions[‘Score’]) |
This approach ensures consistency between training and production environments. The saved pipeline handles all necessary data transformations automatically, reducing the risk of preprocessing mismatches in production.
Using a Custom Model
Creating custom models in PyCaret can be very useful in cases where:
- you want to implement a novel algorithm that isn’t available in standard libraries
- you need to modify an existing algorithm to suit your specific problem
- you want more control over the model’s behavior or performance
In PyCaret, you can create your own custom machine learning models using scikit-learn, which gives you finer control over how your model behaves. To use your custom model in PyCaret, you need to extend two classes from scikit-learn:
- BaseEstimator: This class gives basic functions for training and using models, like fitting and predicting
- ClassifierMixin: This class adds methods for classification tasks, like predicting which class a sample belongs to
To demonstrate how to create a custom model, let’s walk through an implementation of a weighted K-Nearest Neighbors (KNN) classifier.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
from sklearn.base import BaseEstimator, ClassifierMixin from sklearn.neighbors import NearestNeighbors from sklearn.utils.validation import check_X_y, check_array, check_is_fitted from sklearn.utils.multiclass import unique_labels import numpy as np  class WeightedKNN(BaseEstimator, ClassifierMixin):     def __init__(self, n_neighbors=5):         self.n_neighbors = n_neighbors             def fit(self, X, y):         X, y = check_X_y(X, y)         self.classes_ = unique_labels(y)         self.nn_ = NearestNeighbors(n_neighbors=self.n_neighbors).fit(X)         self.y_ = y         return self         def predict_proba(self, X):         check_is_fitted(self)         X = check_array(X)         distances, indices = self.nn_.kneighbors(X)                 weights = 1 / (distances + np.finfo(float).eps)         weights /= np.sum(weights, axis=1)[:, np.newaxis]                 proba = np.zeros((X.shape[0], len(self.classes_)))         for i in range(X.shape[0]):             for j in range(self.n_neighbors):                 class_idx = np.where(self.classes_ == self.y_[indices[i, j]])[0][0]                 proba[i, class_idx] += weights[i, j]         return proba         def predict(self, X):         return self.classes_[np.argmax(self.predict_proba(X), axis=1)] |
After you’ve created your custom model, you can easily integrate it with PyCaret using the create_model() function. This function will allow PyCaret to handle the custom model just as it would any built-in model.
custom_knn = create_model(WeightedKNN(n_neighbors=3)) |
Conclusion
Creating a custom model pipeline in PyCaret can help make your entire machine learning workflow much easier to implement. PyCaret can help with data prep, building models, and evaluating them. You can even add your own custom models and use PyCaret’s tools to improve them. After tuning and testing, models can be saved and used in production.