The Power of Pipelines – MachineLearningMastery.com

Machine learning projects often require the execution of a sequence of data preprocessing steps followed by a learning algorithm. Managing these steps individually can be cumbersome and error-prone. This is where sklearn pipelines come into play. This post will explore how pipelines automate critical aspects of machine learning workflows, such as data preprocessing, feature engineering, and the incorporation of machine learning algorithms.

Let’s get started.

The Power of Pipelines
Photo by Quinten de Graaf. Some rights reserved.

Overview

This post is divided into three parts; they are:

What is a Pipeline?
Elevating Our Model with Advanced Transformations
Handling Missing Data with Imputation in Pipelines

What is a Pipeline?

A pipeline is used to automate and encapsulate the sequence of various transformation steps and the final estimator into one object. By defining a pipeline, you ensure that the same sequence of steps is applied to both the training and the testing data, enhancing the reproducibility and reliability of your model.

Let’s demonstrate the implementation of a pipeline and compare it with a traditional approach without a pipeline. Consider a simple scenario where we want to predict house prices based on the quality of a house, using the ‘OverallQual’ feature from the Ames Housing dataset. Here’s a side-by-side comparison of performing 5-fold cross-validation with and without using a pipeline:

# Import necessary libraries import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline # Prepare data and setup for linear regression Ames = pd.read_csv(‘Ames.csv’) y = Ames[‘SalePrice’] linear_model = LinearRegression() # Perform 5-fold cross-validation without Pipeline cv_score = cross_val_score(linear_model, Ames[[‘OverallQual’]], y).mean() print(“Example Without Pipeline, Mean CV R² score for ‘OverallQual’: {:.3f}”.format(cv_score)) # Perform 5-fold cross-validation WITH Pipeline pipeline = Pipeline([(‘regressor’, linear_model)]) pipeline_score = cross_val_score(pipeline, Ames[[‘OverallQual’]], y, cv=5).mean() print(“Example With Pipeline, Mean CV R² for ‘OverallQual’: {:.3f}”.format(pipeline_score))

# Import necessary libraries

import pandas as pd

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression

from sklearn.pipeline import Pipeline

# Prepare data and setup for linear regression

Ames = pd.read_csv(‘Ames.csv’)

y = Ames[‘SalePrice’]

linear_model = LinearRegression()

# Perform 5-fold cross-validation without Pipeline

cv_score = cross_val_score(linear_model, Ames[[‘OverallQual’]], y).mean()

print(“Example Without Pipeline, Mean CV R² score for ‘OverallQual’: {:.3f}”.format(cv_score))

# Perform 5-fold cross-validation WITH Pipeline

pipeline = Pipeline([(‘regressor’, linear_model)])

pipeline_score = cross_val_score(pipeline, Ames[[‘OverallQual’]], y, cv=5).mean()

print(“Example With Pipeline, Mean CV R² for ‘OverallQual’: {:.3f}”.format(pipeline_score))

Both methods yield exactly the same results:

Example Without Pipeline, Mean CV R² score for ‘OverallQual’: 0.618 Example With Pipeline, Mean CV R² for ‘OverallQual’: 0.618

Example Without Pipeline, Mean CV R² score for ‘OverallQual’: 0.618

Example With Pipeline, Mean CV R² for ‘OverallQual’: 0.618

Here is a visual to illustrate this basic pipeline.

This example uses a straightforward case with only one feature. Still, as models grow more complex, pipelines can manage multiple preprocessing steps, such as scaling, encoding, and dimensionality reduction, before applying the model.

Building on our foundational understanding of sklearn pipelines, let’s expand our scenario to include feature engineering — an essential step in improving model performance. Feature engineering involves creating new features from the existing data that might have a stronger relationship with the target variable. In our case, we suspect that the interaction between the quality of a house and its living area could be a better predictor of the house price than either feature alone. Here’s another side-by-side comparison of performing 5-fold cross-validation with and without using a pipeline:

# Import necessary libraries

import pandas as pd

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import FunctionTransformer

# Prepare data and setup for linear regression

Ames = pd.read_csv(‘Ames.csv’)

y = Ames[‘SalePrice’]

linear_model = LinearRegression()

# Perform 5-fold cross-validation without Pipeline

Ames[‘OWA’] = Ames[‘OverallQual’] * Ames[‘GrLivArea’]

cv_score_2 = cross_val_score(linear_model, Ames[[‘OWA’]], y).mean()

print(“Example Without Pipeline, Mean CV R² score for ‘Quality Weighted Area’: {:.3f}”.format(cv_score_2))

# WITH Pipeline

# Define the transformation function for ‘QualityArea’

def create_quality_area(X):

X[‘QualityArea’] = X[‘OverallQual’] * X[‘GrLivArea’]

return X[[‘QualityArea’]].values

# Setup the FunctionTransformer using the function

quality_area_transformer = FunctionTransformer(create_quality_area)

# Pipeline using the engineered feature ‘QualityArea’

pipeline_2 = Pipeline([

(‘quality_area_transform’, quality_area_transformer),

(‘regressor’, linear_model)

])

pipeline_score_2 = cross_val_score(pipeline_2, Ames[[‘OverallQual’, ‘GrLivArea’]], y, cv=5).mean()

# Output the mean CV scores rounded to four decimal places

print(“Example With Pipeline, Mean CV R² score for ‘Quality Weighted Area’: {:.3f}”.format(pipeline_score_2))

Both methods produce the same results again:

Example Without Pipeline, Mean CV R² score for ‘Quality Weighted Area’: 0.748 Example With Pipeline, Mean CV R² score for ‘Quality Weighted Area’: 0.748

Example Without Pipeline, Mean CV R² score for ‘Quality Weighted Area’: 0.748

Example With Pipeline, Mean CV R² score for ‘Quality Weighted Area’: 0.748

This output indicates that using a pipeline, we encapsulate feature engineering within our model training process, making it an integral part of the cross-validation. With pipelines, each cross-validation fold will now generate the ‘Quality Weighted Area’ feature within the pipeline, ensuring that our feature engineering step is validated correctly, avoiding data leakage and, thus, producing a more reliable estimate of model performance.

Here is a visual to illustrate how we used the FunctionTransformer as part of our preprocessing step in this pipeline.

The pipelines above ensure that our feature engineering and preprocessing efforts accurately reflect the model’s performance metrics. As we continue, we’ll venture into more advanced territory, showcasing the robustness of pipelines when dealing with various preprocessing tasks and different types of variables.

Elevating Our Model with Advanced Transformations

Our next example incorporates a cubic transformation, engineered features, and categorical encoding and includes raw features without any transformation. This exemplifies how a pipeline can handle a mix of data types and transformations, streamlining the preprocessing and modeling steps into a cohesive process.

# Import necessary libraries import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import FunctionTransformer, OneHotEncoder # Prepare data and setup for linear regression Ames = pd.read_csv(‘Ames.csv’) y = Ames[‘SalePrice’] linear_model = LinearRegression() # Function to apply cubic transformation def cubic_transformation(x): return x ** 3 # Function to create ‘QualityArea’ def create_quality_area(X): X[‘QualityArea’] = X[‘OverallQual’] * X[‘GrLivArea’] return X[[‘QualityArea’]].values # Setup the FunctionTransformer for cubic and quality area transformations cubic_transformer = FunctionTransformer(cubic_transformation) quality_area_transformer = FunctionTransformer(create_quality_area) # Setup ColumnTransformer for preprocessing preprocessor = ColumnTransformer( transformers=[ (‘cubic’, cubic_transformer, [‘OverallQual’]), (‘quality_area_transform’, quality_area_transformer, [‘OverallQual’, ‘GrLivArea’]), (‘onehot’, OneHotEncoder(drop=’first’, handle_unknown=’ignore’), [‘Neighborhood’, ‘ExterQual’, ‘KitchenQual’]), (‘passthrough’, ‘passthrough’, [‘YearBuilt’]) ]) # Create the pipeline with the preprocessor and linear regression pipeline_3 = Pipeline([ (‘preprocessor’, preprocessor), (‘regressor’, linear_model) ]) # Evaluate the pipeline using 5-fold cross-validation pipeline_score_3 = cross_val_score(pipeline_3, Ames, y, cv=5).mean() # Output the mean CV scores rounded to four decimal places print(“Mean CV R² score with enhanced transformations: {:.3f}”.format(pipeline_score_3))

# Import necessary libraries

import pandas as pd

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

# Prepare data and setup for linear regression

Ames = pd.read_csv(‘Ames.csv’)

y = Ames[‘SalePrice’]

linear_model = LinearRegression()

# Function to apply cubic transformation

def cubic_transformation(x):

return x ** 3

# Function to create ‘QualityArea’

def create_quality_area(X):

X[‘QualityArea’] = X[‘OverallQual’] * X[‘GrLivArea’]

return X[[‘QualityArea’]].values

# Setup the FunctionTransformer for cubic and quality area transformations

cubic_transformer = FunctionTransformer(cubic_transformation)

quality_area_transformer = FunctionTransformer(create_quality_area)

# Setup ColumnTransformer for preprocessing

preprocessor = ColumnTransformer(

transformers=[

(‘cubic’, cubic_transformer, [‘OverallQual’]),

(‘quality_area_transform’, quality_area_transformer, [‘OverallQual’, ‘GrLivArea’]),

(‘onehot’, OneHotEncoder(drop=‘first’, handle_unknown=‘ignore’), [‘Neighborhood’, ‘ExterQual’, ‘KitchenQual’]),

(‘passthrough’, ‘passthrough’, [‘YearBuilt’])

])

# Create the pipeline with the preprocessor and linear regression

pipeline_3 = Pipeline([

(‘preprocessor’, preprocessor),

(‘regressor’, linear_model)

])

# Evaluate the pipeline using 5-fold cross-validation

pipeline_score_3 = cross_val_score(pipeline_3, Ames, y, cv=5).mean()

# Output the mean CV scores rounded to four decimal places

print(“Mean CV R² score with enhanced transformations: {:.3f}”.format(pipeline_score_3))

Feature engineering is an art that often requires a creative touch. By applying a cubic transformation to the ‘OverallQual’ feature, we hypothesize that the non-linear relationship between quality and price could be better captured. Additionally, we engineer a ‘QualityArea’ feature, which we believe might interact more significantly with the sale price than the individual features alone. We also cater to the categorical features ‘Neighborhood’, ‘ExterQual’, and ‘KitchenQual’ by employing one-hot encoding, a crucial step in preparing textual data for modeling. We pass it directly into the model to ensure that the valuable temporal information from ‘YearBuilt’ is not transformed unnecessarily. The above pipeline yields the following:

Mean CV R² score with enhanced transformations: 0.850

Mean CV R² score with enhanced transformations: 0.850

With an impressive mean CV R² score of 0.850, this pipeline demonstrates the substantial impact of thoughtful feature engineering and preprocessing on model performance. It highlights pipeline efficiency and scalability and underscores their strategic importance in building robust predictive models. Here is a visual to illustrate this pipeline.

The true advantage of this methodology lies in its unified workflow. By elegantly combining feature engineering, transformations, and model evaluation into a single, coherent process, pipelines greatly enhance the accuracy and validity of our predictive models. This advanced example reinforces the concept that, with pipelines, complexity does not come at the cost of clarity or performance in machine learning workflows.

Handling Missing Data with Imputation in Pipelines

The reality of most datasets, especially large ones, is that they often contain missing values. Neglecting to handle these missing values can lead to significant biases or errors in your predictive models. In this section, we will demonstrate how to seamlessly integrate data imputation into our pipeline to ensure that our linear regression model is robust against such issues.

In a previous post, we delved into the depths of missing data, manually imputing missing values in the Ames dataset without using pipelines. Building on that foundation, we now introduce how to streamline and automate imputation within our pipeline framework, providing a more efficient and error-proof approach suitable even for those new to the concept.

We have chosen to use a SimpleImputer to handle the missing values for the ‘BsmtQual’ (Basement Quality) feature, a categorical variable in our dataset. The SimpleImputer will replace missing values with the constant ‘None’, indicating the absence of a basement. Post-imputation, we employ a OneHotEncoder to convert this categorical data into a numerical format suitable for our linear model. By nesting this imputation within our pipeline, we ensure that the imputation strategy is correctly applied during both the training and testing phases, thus preventing any data leakage and maintaining the integrity of our model evaluation through cross-validation.

Here’s how we integrate this into our pipeline setup:

# Import necessary libraries import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import FunctionTransformer, OneHotEncoder from sklearn.impute import SimpleImputer # Load data Ames = pd.read_csv(‘Ames.csv’) y = Ames[‘SalePrice’] linear_model = LinearRegression() # Function to apply cubic transformation def cubic_transformation(x): return x ** 3 # Function to create ‘QualityArea’ def create_quality_area(X): X[‘QualityArea’] = X[‘OverallQual’] * X[‘GrLivArea’] return X[[‘QualityArea’]].values # Setup the FunctionTransformer for cubic and quality area transformations cubic_transformer = FunctionTransformer(cubic_transformation) quality_area_transformer = FunctionTransformer(create_quality_area) # Prepare the BsmtQual imputation and encoding within a nested pipeline bsmt_qual_transformer = Pipeline([ (‘imputer’, SimpleImputer(strategy=’constant’, fill_value=”None”)), (‘encoder’, OneHotEncoder(handle_unknown=’ignore’)) ]) # Setup ColumnTransformer for all preprocessing preprocessor = ColumnTransformer( transformers=[ (‘cubic’, cubic_transformer, [‘OverallQual’]), (‘quality_area_transform’, quality_area_transformer, [‘OverallQual’, ‘GrLivArea’]), (‘onehot’, OneHotEncoder(drop=’first’, handle_unknown=’ignore’), [‘Neighborhood’, ‘ExterQual’, ‘KitchenQual’]), (‘bsmt_qual’, bsmt_qual_transformer, [‘BsmtQual’]), # Adding BsmtQual handling (‘passthrough’, ‘passthrough’, [‘YearBuilt’]) ]) # Create the pipeline with the preprocessor and linear regression pipeline_4 = Pipeline([ (‘preprocessor’, preprocessor), (‘regressor’, linear_model) ]) # Evaluate the pipeline using 5-fold cross-validation pipeline_score = cross_val_score(pipeline_4, Ames, y, cv=5).mean() # Output the mean CV scores rounded to four decimal places print(“Mean CV R² score with imputing & transformations: {:.3f}”.format(pipeline_score))

# Import necessary libraries

import pandas as pd

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

from sklearn.impute import SimpleImputer

# Load data

Ames = pd.read_csv(‘Ames.csv’)

y = Ames[‘SalePrice’]

linear_model = LinearRegression()

# Function to apply cubic transformation

def cubic_transformation(x):

return x ** 3

# Function to create ‘QualityArea’

def create_quality_area(X):

X[‘QualityArea’] = X[‘OverallQual’] * X[‘GrLivArea’]

return X[[‘QualityArea’]].values

# Setup the FunctionTransformer for cubic and quality area transformations

cubic_transformer = FunctionTransformer(cubic_transformation)

quality_area_transformer = FunctionTransformer(create_quality_area)

# Prepare the BsmtQual imputation and encoding within a nested pipeline

bsmt_qual_transformer = Pipeline([

(‘imputer’, SimpleImputer(strategy=‘constant’, fill_value=‘None’)),

(‘encoder’, OneHotEncoder(handle_unknown=‘ignore’))

])

# Setup ColumnTransformer for all preprocessing

preprocessor = ColumnTransformer(

transformers=[

(‘cubic’, cubic_transformer, [‘OverallQual’]),

(‘quality_area_transform’, quality_area_transformer, [‘OverallQual’, ‘GrLivArea’]),

(‘onehot’, OneHotEncoder(drop=‘first’, handle_unknown=‘ignore’), [‘Neighborhood’, ‘ExterQual’, ‘KitchenQual’]),

(‘bsmt_qual’, bsmt_qual_transformer, [‘BsmtQual’]), # Adding BsmtQual handling

(‘passthrough’, ‘passthrough’, [‘YearBuilt’])

])

# Create the pipeline with the preprocessor and linear regression

pipeline_4 = Pipeline([

(‘preprocessor’, preprocessor),

(‘regressor’, linear_model)

])

# Evaluate the pipeline using 5-fold cross-validation

pipeline_score = cross_val_score(pipeline_4, Ames, y, cv=5).mean()

# Output the mean CV scores rounded to four decimal places

print(“Mean CV R² score with imputing & transformations: {:.3f}”.format(pipeline_score))

The use of SimpleImputer in our pipeline helps efficiently handle missing data. When coupled with the rest of the preprocessing steps and the linear regression model, the complete setup allows us to evaluate the true impact of our preprocessing choices on model performance.

Mean CV R² score with imputing & transformations: 0.856

Mean CV R² score with imputing & transformations: 0.856

Here is a visual of our pipeline which includes missing data imputation:

This integration showcases the flexibility of sklearn pipelines and emphasizes how essential preprocessing steps, like imputation, are seamlessly included in the machine learning workflow, enhancing the model’s reliability and accuracy.

Summary

In this post, we explored the utilization of sklearn pipelines, culminating in the sophisticated integration of data imputation for handling missing values within a linear regression context. We illustrated the seamless automation of data preprocessing steps, feature engineering, and the inclusion of advanced transformations to refine our model’s performance. The methodology highlighted in this post is not only about maintaining the workflow’s efficiency but also about ensuring the consistency and accuracy of the predictive models we aspire to build.

Specifically, you learned:

The foundational concept of sklearn pipelines and how they encapsulate a sequence of data transformations and a final estimator.
When integrated into pipelines, feature engineering can enhance model performance by creating new, more predictive features.
The strategic use of SimpleImputer within pipelines to handle missing data effectively, preventing data leakage and improving model reliability.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

…using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner’s Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more…all to support you in creating a narrative from a dataset.

Kick-start your data science journey with hands-on exercises

See What’s Inside

The Power of Pipelines – MachineLearningMastery.com

Overview

What is a Pipeline?

Elevating Our Model with Advanced Transformations

Handling Missing Data with Imputation in Pipelines

Further Reading

APIs

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

Kick-start your data science journey with hands-on exercises

Recent Articles

GOP Nominee in North Carolina Exposed Over Old Porn Site Comments

10 GitHub Repositories for Deep Learning Enthusiasts

5 Real-World Machine Learning Projects You Can Build This Weekend

Need better network performance? Adopt better secure networking strategies

Quick Hit #19 | CSS-Tricks

Related Stories

Leave A Reply Cancel reply