Machine learning projects often require the execution of a sequence of data preprocessing steps followed by a learning algorithm. Managing these steps individually can be cumbersome and error-prone. This is where sklearn
pipelines come into play. This post will explore how pipelines automate critical aspects of machine learning workflows, such as data preprocessing, feature engineering, and the incorporation of machine learning algorithms.
Let’s get started.
Overview
This post is divided into three parts; they are:
- What is a Pipeline?
- Elevating Our Model with Advanced Transformations
- Handling Missing Data with Imputation in Pipelines
What is a Pipeline?
A pipeline is used to automate and encapsulate the sequence of various transformation steps and the final estimator into one object. By defining a pipeline, you ensure that the same sequence of steps is applied to both the training and the testing data, enhancing the reproducibility and reliability of your model.
Let’s demonstrate the implementation of a pipeline and compare it with a traditional approach without a pipeline. Consider a simple scenario where we want to predict house prices based on the quality of a house, using the ‘OverallQual’ feature from the Ames Housing dataset. Here’s a side-by-side comparison of performing 5-fold cross-validation with and without using a pipeline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Import necessary libraries import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline  # Prepare data and setup for linear regression Ames = pd.read_csv(‘Ames.csv’) y = Ames[‘SalePrice’] linear_model = LinearRegression()  # Perform 5-fold cross-validation without Pipeline cv_score = cross_val_score(linear_model, Ames[[‘OverallQual’]], y).mean() print(“Example Without Pipeline, Mean CV R² score for ‘OverallQual’: {:.3f}”.format(cv_score))  # Perform 5-fold cross-validation WITH Pipeline pipeline = Pipeline([(‘regressor’, linear_model)]) pipeline_score = cross_val_score(pipeline, Ames[[‘OverallQual’]], y, cv=5).mean() print(“Example With Pipeline, Mean CV R² for ‘OverallQual’: {:.3f}”.format(pipeline_score)) |
Both methods yield exactly the same results:
Example Without Pipeline, Mean CV R² score for ‘OverallQual’: 0.618 Example With Pipeline, Mean CV R² for ‘OverallQual’: 0.618 |
Here is a visual to illustrate this basic pipeline.
This example uses a straightforward case with only one feature. Still, as models grow more complex, pipelines can manage multiple preprocessing steps, such as scaling, encoding, and dimensionality reduction, before applying the model.
Building on our foundational understanding of sklearn pipelines, let’s expand our scenario to include feature engineering — an essential step in improving model performance. Feature engineering involves creating new features from the existing data that might have a stronger relationship with the target variable. In our case, we suspect that the interaction between the quality of a house and its living area could be a better predictor of the house price than either feature alone. Here’s another side-by-side comparison of performing 5-fold cross-validation with and without using a pipeline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# Import necessary libraries import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import FunctionTransformer  # Prepare data and setup for linear regression Ames = pd.read_csv(‘Ames.csv’) y = Ames[‘SalePrice’] linear_model = LinearRegression()  # Perform 5-fold cross-validation without Pipeline Ames[‘OWA’] = Ames[‘OverallQual’] * Ames[‘GrLivArea’] cv_score_2 = cross_val_score(linear_model, Ames[[‘OWA’]], y).mean() print(“Example Without Pipeline, Mean CV R² score for ‘Quality Weighted Area’: {:.3f}”.format(cv_score_2))  # WITH Pipeline # Define the transformation function for ‘QualityArea’ def create_quality_area(X):     X[‘QualityArea’] = X[‘OverallQual’] * X[‘GrLivArea’]     return X[[‘QualityArea’]].values  # Setup the FunctionTransformer using the function quality_area_transformer = FunctionTransformer(create_quality_area)  # Pipeline using the engineered feature ‘QualityArea’ pipeline_2 = Pipeline([     (‘quality_area_transform’, quality_area_transformer),     (‘regressor’, linear_model) ]) pipeline_score_2 = cross_val_score(pipeline_2, Ames[[‘OverallQual’, ‘GrLivArea’]], y, cv=5).mean()  # Output the mean CV scores rounded to four decimal places print(“Example With Pipeline, Mean CV R² score for ‘Quality Weighted Area’: {:.3f}”.format(pipeline_score_2)) |
Both methods produce the same results again:
Example Without Pipeline, Mean CV R² score for ‘Quality Weighted Area’: 0.748 Example With Pipeline, Mean CV R² score for ‘Quality Weighted Area’: 0.748 |
This output indicates that using a pipeline, we encapsulate feature engineering within our model training process, making it an integral part of the cross-validation. With pipelines, each cross-validation fold will now generate the ‘Quality Weighted Area’ feature within the pipeline, ensuring that our feature engineering step is validated correctly, avoiding data leakage and, thus, producing a more reliable estimate of model performance.
Here is a visual to illustrate how we used the FunctionTransformer
as part of our preprocessing step in this pipeline.
The pipelines above ensure that our feature engineering and preprocessing efforts accurately reflect the model’s performance metrics. As we continue, we’ll venture into more advanced territory, showcasing the robustness of pipelines when dealing with various preprocessing tasks and different types of variables.
Elevating Our Model with Advanced Transformations
Our next example incorporates a cubic transformation, engineered features, and categorical encoding and includes raw features without any transformation. This exemplifies how a pipeline can handle a mix of data types and transformations, streamlining the preprocessing and modeling steps into a cohesive process.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# Import necessary libraries import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import FunctionTransformer, OneHotEncoder  # Prepare data and setup for linear regression Ames = pd.read_csv(‘Ames.csv’) y = Ames[‘SalePrice’] linear_model = LinearRegression()  # Function to apply cubic transformation def cubic_transformation(x):     return x ** 3  # Function to create ‘QualityArea’ def create_quality_area(X):     X[‘QualityArea’] = X[‘OverallQual’] * X[‘GrLivArea’]     return X[[‘QualityArea’]].values  # Setup the FunctionTransformer for cubic and quality area transformations cubic_transformer = FunctionTransformer(cubic_transformation) quality_area_transformer = FunctionTransformer(create_quality_area)  # Setup ColumnTransformer for preprocessing preprocessor = ColumnTransformer(     transformers=[         (‘cubic’, cubic_transformer, [‘OverallQual’]),         (‘quality_area_transform’, quality_area_transformer, [‘OverallQual’, ‘GrLivArea’]),         (‘onehot’, OneHotEncoder(drop=‘first’, handle_unknown=‘ignore’), [‘Neighborhood’, ‘ExterQual’, ‘KitchenQual’]),         (‘passthrough’, ‘passthrough’, [‘YearBuilt’])     ])  # Create the pipeline with the preprocessor and linear regression pipeline_3 = Pipeline([     (‘preprocessor’, preprocessor),     (‘regressor’, linear_model) ])  # Evaluate the pipeline using 5-fold cross-validation pipeline_score_3 = cross_val_score(pipeline_3, Ames, y, cv=5).mean()  # Output the mean CV scores rounded to four decimal places print(“Mean CV R² score with enhanced transformations: {:.3f}”.format(pipeline_score_3)) |
Feature engineering is an art that often requires a creative touch. By applying a cubic transformation to the ‘OverallQual’ feature, we hypothesize that the non-linear relationship between quality and price could be better captured. Additionally, we engineer a ‘QualityArea’ feature, which we believe might interact more significantly with the sale price than the individual features alone. We also cater to the categorical features ‘Neighborhood’, ‘ExterQual’, and ‘KitchenQual’ by employing one-hot encoding, a crucial step in preparing textual data for modeling. We pass it directly into the model to ensure that the valuable temporal information from ‘YearBuilt’ is not transformed unnecessarily. The above pipeline yields the following:
Mean CV R² score with enhanced transformations: 0.850 |
With an impressive mean CV R² score of 0.850, this pipeline demonstrates the substantial impact of thoughtful feature engineering and preprocessing on model performance. It highlights pipeline efficiency and scalability and underscores their strategic importance in building robust predictive models. Here is a visual to illustrate this pipeline.
The true advantage of this methodology lies in its unified workflow. By elegantly combining feature engineering, transformations, and model evaluation into a single, coherent process, pipelines greatly enhance the accuracy and validity of our predictive models. This advanced example reinforces the concept that, with pipelines, complexity does not come at the cost of clarity or performance in machine learning workflows.
Handling Missing Data with Imputation in Pipelines
The reality of most datasets, especially large ones, is that they often contain missing values. Neglecting to handle these missing values can lead to significant biases or errors in your predictive models. In this section, we will demonstrate how to seamlessly integrate data imputation into our pipeline to ensure that our linear regression model is robust against such issues.
In a previous post, we delved into the depths of missing data, manually imputing missing values in the Ames dataset without using pipelines. Building on that foundation, we now introduce how to streamline and automate imputation within our pipeline framework, providing a more efficient and error-proof approach suitable even for those new to the concept.
We have chosen to use a SimpleImputer
to handle the missing values for the ‘BsmtQual’ (Basement Quality) feature, a categorical variable in our dataset. The SimpleImputer
will replace missing values with the constant ‘None’, indicating the absence of a basement. Post-imputation, we employ a OneHotEncoder
to convert this categorical data into a numerical format suitable for our linear model. By nesting this imputation within our pipeline, we ensure that the imputation strategy is correctly applied during both the training and testing phases, thus preventing any data leakage and maintaining the integrity of our model evaluation through cross-validation.
Here’s how we integrate this into our pipeline setup:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# Import necessary libraries import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import FunctionTransformer, OneHotEncoder from sklearn.impute import SimpleImputer  # Load data Ames = pd.read_csv(‘Ames.csv’) y = Ames[‘SalePrice’] linear_model = LinearRegression()  # Function to apply cubic transformation def cubic_transformation(x):     return x ** 3  # Function to create ‘QualityArea’ def create_quality_area(X):     X[‘QualityArea’] = X[‘OverallQual’] * X[‘GrLivArea’]     return X[[‘QualityArea’]].values  # Setup the FunctionTransformer for cubic and quality area transformations cubic_transformer = FunctionTransformer(cubic_transformation) quality_area_transformer = FunctionTransformer(create_quality_area)  # Prepare the BsmtQual imputation and encoding within a nested pipeline bsmt_qual_transformer = Pipeline([     (‘imputer’, SimpleImputer(strategy=‘constant’, fill_value=‘None’)),     (‘encoder’, OneHotEncoder(handle_unknown=‘ignore’)) ])  # Setup ColumnTransformer for all preprocessing preprocessor = ColumnTransformer(     transformers=[         (‘cubic’, cubic_transformer, [‘OverallQual’]),         (‘quality_area_transform’, quality_area_transformer, [‘OverallQual’, ‘GrLivArea’]),         (‘onehot’, OneHotEncoder(drop=‘first’, handle_unknown=‘ignore’), [‘Neighborhood’, ‘ExterQual’, ‘KitchenQual’]),         (‘bsmt_qual’, bsmt_qual_transformer, [‘BsmtQual’]),  # Adding BsmtQual handling         (‘passthrough’, ‘passthrough’, [‘YearBuilt’])     ])  # Create the pipeline with the preprocessor and linear regression pipeline_4 = Pipeline([     (‘preprocessor’, preprocessor),     (‘regressor’, linear_model) ])  # Evaluate the pipeline using 5-fold cross-validation pipeline_score = cross_val_score(pipeline_4, Ames, y, cv=5).mean()  # Output the mean CV scores rounded to four decimal places print(“Mean CV R² score with imputing & transformations: {:.3f}”.format(pipeline_score)) |
The use of SimpleImputer
in our pipeline helps efficiently handle missing data. When coupled with the rest of the preprocessing steps and the linear regression model, the complete setup allows us to evaluate the true impact of our preprocessing choices on model performance.
Mean CV R² score with imputing & transformations: 0.856 |
Here is a visual of our pipeline which includes missing data imputation:
Â
This integration showcases the flexibility of sklearn pipelines and emphasizes how essential preprocessing steps, like imputation, are seamlessly included in the machine learning workflow, enhancing the model’s reliability and accuracy.
Further Reading
APIs
Tutorials
Ames Housing Dataset & Data Dictionary
Summary
In this post, we explored the utilization of sklearn pipelines, culminating in the sophisticated integration of data imputation for handling missing values within a linear regression context. We illustrated the seamless automation of data preprocessing steps, feature engineering, and the inclusion of advanced transformations to refine our model’s performance. The methodology highlighted in this post is not only about maintaining the workflow’s efficiency but also about ensuring the consistency and accuracy of the predictive models we aspire to build.
Specifically, you learned:
- The foundational concept of sklearn pipelines and how they encapsulate a sequence of data transformations and a final estimator.
- When integrated into pipelines, feature engineering can enhance model performance by creating new, more predictive features.
- The strategic use of
SimpleImputer
within pipelines to handle missing data effectively, preventing data leakage and improving model reliability.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.