Building a Custom Model Pipeline in PyCaret: From Data Prep to Production

Building a Custom Model Pipeline in PyCaret: From Data Prep to Production
Image by Editor | Midjourney

Building a custom model pipeline in PyCaret can help make machine learning easier. PyCaret is able to automate many steps, including data preparation and model training. It can also allow you to create and use your own custom models.

In this article, we will build a custom machine learning pipeline step by step using PyCaret.

What is PyCaret?

PyCaret is a tool that automates machine learning workflows. It handles repetitive tasks such as scaling data, encoding variables, and tuning hyperparameters. PyCaret supports many machine learning tasks, including:

Classification (predict categories)
Regression (predict numbers)
Clustering (group data)
Anomaly detection (identify outliers)

PyCaret works well with popular libraries like scikit-learn, XGBoost, and LightGBM.

Setting Up the Environment

First, install PyCaret using pip:

Next, import the correct module for your task:

from pycaret.classification import * # For classification tasks from pycaret.regression import * # For regression tasks

from pycaret.classification import * # For classification tasks

from pycaret.regression import * # For regression tasks

Preparing the Data

Before starting a machine learning project, you need to prepare the data. PyCaret works well with Pandas, and this combination can be used to help you with your data preparation.

Here’s how to load and explore the Iris dataset:

from sklearn.datasets import load_iris import pandas as pd iris = load_iris() data = pd.DataFrame(iris.data, columns=iris.feature_names) data[‘target’] = iris.target

from sklearn.datasets import load_iris

import pandas as pd

iris = load_iris()

data = pd.DataFrame(iris.data, columns=iris.feature_names)

data[‘target’] = iris.target

Ensure your data is clean and contains a target column — in our case, this is iris.target. This is the variable you want to predict.

Setting Up the PyCaret Environment

PyCaret’s setup() function prepares your data for training. It handles tasks such as:

Fill missing values: Automatically replaces missing data with appropriate values
Encode categorical variables: Converts non-numerical categories into numbers
Scale numerical features: Normalizes data to ensure uniformity

Here’s how to set it up:

from pycaret.classification import setup # Initialize the environment exp1 = setup(data, target=”target”)

from pycaret.classification import setup

# Initialize the environment

exp1 = setup(data, target=‘target’)

setup setup

Some important setup parameters that deserve being mentioned include:

preprocess=True/False: this is for controlling preprocessing
session_id: this allows for reproducibility
fold: this allows for describing and using a cross-validation strategy
fix_imbalance=True: this parameter allows for the handling of imbalanced datasets

In summary, this step prepares the data and creates a foundation for training models.

Available Models

PyCaret provides a range of machine learning algorithms. You can view a list of supported models using the models() function:

# List available models models()

# List available models

models()

models models

This function generates a table showing each model’s name, a short identifier (ID), and a brief description. Users can quickly view and subsequently assess which algorithms are suitable for their task.

Comparing Models

The compare_models() function evaluates and ranks multiple models based on their performance metrics, and is one of PyCaret’s great many beneficial workflow functions. It helps identify the best model for your dataset by comparing models using metrics like:

Accuracy: For classification tasks
R-squared: For regression tasks

Here’s how to use it:

# Compare models and find the best one best_model = compare_models() # Print the best model print(best_model)

# Compare models and find the best one

best_model = compare_models()

# Print the best model

print(best_model)

compare_models

This will compare all the available models using default hyperparameters and print the details of the best model based on the performance metric. The best_model object will contain the model with the best performance score.

Creating the Model

After comparing models with compare_models(), you can create the best model using the create_model() function.

# Train the best model model = create_model(best_model)

# Train the best model

model = create_model(best_model)

create_model

This function trains the selected model on your dataset.

Hyperparameter Tuning

Fine-tuning your model’s parameters can significantly improve its performance. PyCaret automates this process with smart search strategies.

# Tune model with random search tuned_model = tune_model(model, n_iter=50, optimize=”Accuracy”) # Use specific search grid tuned_model = tune_model(model, custom_grid= ‘n_estimators’: [100, 200, 300], ‘max_depth’: [3, 5, 7] )

# Tune model with random search

tuned_model = tune_model(model, n_iter=50, optimize=‘Accuracy’)

# Use specific search grid

tuned_model = tune_model(model, custom_grid=

‘n_estimators’: [100, 200, 300],

‘max_depth’: [3, 5, 7]

)

PyCaret automatically performs cross-validation during tuning and selects the best parameters based on your chosen metric. You can also specify custom parameter grids for more control over the tuning process.

tune_model() also supports different tuning strategies such as grid search and Bayesian optimization:

# Grid search tuned_model = tune_model(model, search_library=’scikit-learn’, search_algorithm=’grid’) # Bayesian optimization tuned_model = tune_model(model, search_library=’optuna’)

# Grid search

tuned_model = tune_model(model, search_library=‘scikit-learn’, search_algorithm=‘grid’)

# Bayesian optimization

tuned_model = tune_model(model, search_library=‘optuna’)

Evaluating the Models

It’s important to evaluate a model’s performance to understand its behavior on unseen data. PyCaret’s evaluate_model() function provides a detailed, interactive review of the model’s performance.

Here are some common evaluation plots available in PyCaret for model evaluation.

Confusion Matrix

The confusion matrix shows how well the model classifies each category in the dataset. It compares the predicted labels against the true labels. This plot helps you understand the errors in the classification.

# Plot confusion matrix plot_model(tuned_model, plot=”confusion_matrix”)

# Plot confusion matrix

plot_model(tuned_model, plot=‘confusion_matrix’)

confusion_matrix

ROC Curve

The ROC curve (Receiver Operating Characteristic curve) shows the trade-off between the True Positive Rate (sensitivity) and the False Positive Rate (1 – specificity) at various threshold settings. It is useful for evaluating classification models, especially when there is class imbalance.

# Plot ROC curve plot_model(tuned_model, plot=”roc”)

# Plot ROC curve

plot_model(tuned_model, plot=‘roc’)

ROC_Curve

Learning Curve

The learning curve shows how the model’s performance improves as the number of training samples increases. It can help you identify if the model is underfitting or overfitting.

# Plot learning curve plot_model(tuned_model, plot=”learning”)

# Plot learning curve

plot_model(tuned_model, plot=‘learning’)

Learning_Curve

Model Interpretation

Understanding how your model makes decisions is important for both debugging and building trust. PyCaret provides several tools for model interpretation.

# Get feature importance interpret_model(model, plot=”feature”) # Generate SHAP values interpret_model(model, plot=”summary”) # Create correlation analysis interpret_model(model, plot=”correlation”)

# Get feature importance

interpret_model(model, plot=‘feature’)

# Generate SHAP values

interpret_model(model, plot=‘summary’)

# Create correlation analysis

interpret_model(model, plot=‘correlation’)

These visualizations help explain which features influence your model’s predictions most strongly. For classification tasks, you can also analyze decision boundaries and confusion matrices to understand model behavior.

Saving and Loading Custom Models

After training and fine-tuning a model, you’ll often want to save it for later use. PyCaret makes this process straightforward. In order to properly save a model, however, you will need to save the preprocessing pipeline as well. Accomplish both of these processes with the below code.

# Train and tune your model model = create_model(‘rf’) tuned_model = tune_model(model) # Save model save_model(tuned_model, ‘final_model’, prep_pipeline=True) # Load model loaded_model = load_model(‘final_model’) # Use model predictions = predict_model(loaded_model, new_data)

# Train and tune your model

model = create_model(‘rf’)

tuned_model = tune_model(model)

# Save model

save_model(tuned_model, ‘final_model’, prep_pipeline=True)

# Load model

loaded_model = load_model(‘final_model’)

# Use model

predictions = predict_model(loaded_model, new_data)

What’s happening:

save_model(tuned_model, ‘final_model’, prep_pipeline=True): saves your tuned_model to file final_model.pkl along with its associated preprocessing pipeline
loaded_model = (‘final_model’): loads the saved model to loaded_model
predictions = predict_model(loaded_model, new_data): use the model while automatically applying preprocessing using the saved pipeline

Creating Production Pipelines

Moving from experimentation and model-building to production and model-deployment requires robust, reproducible pipelines. PyCaret simplifies this transition with built-in pipeline creation.

# Create deployment pipeline final_pipeline = pipeline_model(model) # Add custom transformers from sklearn.preprocessing import StandardScaler pipeline = pipeline_model(model, transformation_pipe=[StandardScaler()]) # Export pipeline for deployment save_model(pipeline, ‘production_ready_model’)

# Create deployment pipeline

final_pipeline = pipeline_model(model)

# Add custom transformers

from sklearn.preprocessing import StandardScaler

pipeline = pipeline_model(model, transformation_pipe=[StandardScaler()])

# Export pipeline for deployment

save_model(pipeline, ‘production_ready_model’)

These pipelines ensure that all preprocessing steps, feature engineering, and model inference happen in the correct order, making deployment more reliable.

Production Deployment

Deploying models to production environments requires careful handling of both model artifacts and preprocessing steps. PyCaret provides tools to make this process seamless.

# Save complete pipeline deployment_ready_model = save_model(final_pipeline, ‘production_model’) # Example production usage loaded_pipeline = load_model(‘production_model’) predictions = predict_model(loaded_pipeline, new_data) # Monitor model performance predictions = predict_model(loaded_pipeline, new_data, raw_score=True) print(predictions[‘Score’])

# Save complete pipeline

deployment_ready_model = save_model(final_pipeline, ‘production_model’)

# Example production usage

loaded_pipeline = load_model(‘production_model’)

predictions = predict_model(loaded_pipeline, new_data)

# Monitor model performance

predictions = predict_model(loaded_pipeline, new_data, raw_score=True)

print(predictions[‘Score’])

This approach ensures consistency between training and production environments. The saved pipeline handles all necessary data transformations automatically, reducing the risk of preprocessing mismatches in production.

Using a Custom Model

Creating custom models in PyCaret can be very useful in cases where:

you want to implement a novel algorithm that isn’t available in standard libraries
you need to modify an existing algorithm to suit your specific problem
you want more control over the model’s behavior or performance

In PyCaret, you can create your own custom machine learning models using scikit-learn, which gives you finer control over how your model behaves. To use your custom model in PyCaret, you need to extend two classes from scikit-learn:

BaseEstimator: This class gives basic functions for training and using models, like fitting and predicting
ClassifierMixin: This class adds methods for classification tasks, like predicting which class a sample belongs to

To demonstrate how to create a custom model, let’s walk through an implementation of a weighted K-Nearest Neighbors (KNN) classifier.

from sklearn.base import BaseEstimator, ClassifierMixin from sklearn.neighbors import NearestNeighbors from sklearn.utils.validation import check_X_y, check_array, check_is_fitted from sklearn.utils.multiclass import unique_labels import numpy as np class WeightedKNN(BaseEstimator, ClassifierMixin): def __init__(self, n_neighbors=5): self.n_neighbors = n_neighbors def fit(self, X, y): X, y = check_X_y(X, y) self.classes_ = unique_labels(y) self.nn_ = NearestNeighbors(n_neighbors=self.n_neighbors).fit(X) self.y_ = y return self def predict_proba(self, X): check_is_fitted(self) X = check_array(X) distances, indices = self.nn_.kneighbors(X) weights = 1 / (distances + np.finfo(float).eps) weights /= np.sum(weights, axis=1)[:, np.newaxis] proba = np.zeros((X.shape[0], len(self.classes_))) for i in range(X.shape[0]): for j in range(self.n_neighbors): class_idx = np.where(self.classes_ == self.y_[indices[i, j]])[0][0] proba[i, class_idx] += weights[i, j] return proba def predict(self, X): return self.classes_[np.argmax(self.predict_proba(X), axis=1)]

from sklearn.base import BaseEstimator, ClassifierMixin

from sklearn.neighbors import NearestNeighbors

from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

from sklearn.utils.multiclass import unique_labels

import numpy as np

class WeightedKNN(BaseEstimator, ClassifierMixin):

def __init__(self, n_neighbors=5):

self.n_neighbors = n_neighbors

def fit(self, X, y):

X, y = check_X_y(X, y)

self.classes_ = unique_labels(y)

self.nn_ = NearestNeighbors(n_neighbors=self.n_neighbors).fit(X)

self.y_ = y

return self

def predict_proba(self, X):

check_is_fitted(self)

X = check_array(X)

distances, indices = self.nn_.kneighbors(X)

weights = 1 / (distances + np.finfo(float).eps)

weights /= np.sum(weights, axis=1)[:, np.newaxis]

proba = np.zeros((X.shape[0], len(self.classes_)))

for i in range(X.shape[0]):

for j in range(self.n_neighbors):

class_idx = np.where(self.classes_ == self.y_[indices[i, j]])[0][0]

proba[i, class_idx] += weights[i, j]

return proba

def predict(self, X):

return self.classes_[np.argmax(self.predict_proba(X), axis=1)]

After you’ve created your custom model, you can easily integrate it with PyCaret using the create_model() function. This function will allow PyCaret to handle the custom model just as it would any built-in model.

custom_knn = create_model(WeightedKNN(n_neighbors=3))

custom_knn = create_model(WeightedKNN(n_neighbors=3))

Conclusion

Creating a custom model pipeline in PyCaret can help make your entire machine learning workflow much easier to implement. PyCaret can help with data prep, building models, and evaluating them. You can even add your own custom models and use PyCaret’s tools to improve them. After tuning and testing, models can be saved and used in production.