Detecting and Overcoming Perfect Multicollinearity in Large Datasets


One of the significant challenges statisticians and data scientists face is multicollinearity, particularly its most severe form, perfect multicollinearity. This issue often lurks undetected in large datasets with many features, potentially disguising itself and skewing the results of statistical models.

In this post, we explore the methods for detecting, addressing, and refining models affected by perfect multicollinearity. Through practical analysis and examples, we aim to equip you with the tools necessary to enhance your models’ robustness and interpretability, ensuring that they deliver reliable insights and accurate predictions.

Let’s get started.

Detecting and Overcoming Perfect Multicollinearity in Large Datasets
Photo by Ryan Stone. Some rights reserved.

Overview

This post is divided into three parts; they are:

  • Exploring the Impact of Perfect Multicollinearity on Linear Regression Models
  • Addressing Multicollinearity with Lasso Regression
  • Refining the Linear Regression Model Using Insights from Lasso Regression

Exploring the Impact of Perfect Multicollinearity on Linear Regression Models

Multiple linear regression is particularly valued for its interpretability. It allows a direct understanding of how each predictor impacts the response variable. However, its effectiveness hinges on the assumption of independent features.

Collinearity means that a variable can be expressed as a linear combination of some other variables. Hence, the variables are not independent of each other.

Linear regression works under the assumption that the feature set has no collinearity. To ensure this assumption holds, understanding a core concept in linear algebra—the rank of a matrix—is vital. In linear regression, the rank reveals the linear independence of features. Essentially, no feature should be a direct linear combination of another. This independence is crucial because dependencies among features—where the rank is less than the number of features—lead to perfect multicollinearity. This condition can distort the interpretability and reliability of a regression model, impacting its utility in making informed decisions.

Let’s explore this with the Ames Housing dataset. We will examine the dataset’s rank and the number of features to detect multicollinearity.

Our preliminary results show that the Ames Housing dataset has multicollinearity, with 27 features but only a rank of 26:

To address this, let’s identify the redundant features using a tailored function. This approach helps make informed decisions about feature selection or modifications to enhance model reliability and interpretability.

The following features have been identified as redundant, indicating that they do not contribute uniquely to the predictive power of the model:

Having identified redundant features in our dataset, it is crucial to understand the nature of their redundancy. Specifically, we suspect that ‘GrLivArea’ may simply be a sum of the first floor area (“1stFlrSF”), second floor area (“2ndFlrSF”), and low-quality finished square feet (“LowQualFinSF”). To verify this, we will calculate the total of these three areas and compare it directly with “GrLivArea” to confirm if they are indeed identical.

Our analysis confirms that “GrLivArea” is precisely the sum of “1stFlrSF”, “2ndFlrSF”, and “LowQualFinSF” in 100% of the cases in the dataset:

Having established the redundancy of “GrLivArea” through matrix rank analysis, we now aim to visualize the effects of multicollinearity on our regression model’s stability and predictive power. The following steps will involve running a Multiple Linear Regression using the redundant features to observe the variance in coefficient estimates. This exercise will help demonstrate the practical impact of multicollinearity in a tangible way, reinforcing the need for careful feature selection in model building.

The results can be demonstrated with the two plots below:

The box plot on the left illustrates the substantial variance in the coefficient estimates. This significant spread in values not only points to the instability of our model but also directly challenges its interpretability. Multiple linear regression is particularly valued for its interpretability, which hinges on its coefficients’ stability and consistency. When coefficients vary widely from one data subset to another, it becomes difficult to derive clear and actionable insights, which are essential for making informed decisions based on the model’s predictions. Given these challenges, a more robust approach is needed to address the variability and instability in our model’s coefficients.

Addressing Multicollinearity with Lasso Regression

Lasso regression presents itself as a robust solution. Unlike multiple linear regression, Lasso can penalize the coefficients’ size and, crucially, set some coefficients to zero, effectively reducing the number of features in the model. This feature selection is particularly beneficial in mitigating multicollinearity. Let’s apply Lasso to our previous example to demonstrate this.

By varying the regularization strength (alpha), we can observe how increasing the penalty affects the coefficients and the predictive accuracy of the model:

 

The box plots on the left show that as alpha increases, the spread and magnitude of the coefficients decrease, indicating more stable estimates. Notably, the coefficient for ‘2ndFlrSF’ begins to approach zero as alpha is set to 1 and is virtually zero when alpha increases to 2. This trend suggests that ‘2ndFlrSF’ contributes minimally to the model as the regularization strength is heightened, indicating that it may be redundant or collinear with other features in the model. This stabilization is a direct result of Lasso’s ability to reduce the influence of less important features, which are likely contributing to multicollinearity.

The fact that ‘2ndFlrSF’ can be removed with minimal impact on the model’s predictability is significant. It underscores the efficiency of Lasso in identifying and eliminating unnecessary predictors. Importantly, the overall predictability of the model remains unchanged even as this feature is effectively zeroed out, demonstrating the robustness of Lasso in maintaining model performance while simplifying its complexity.

Refining the Linear Regression Model Using Insights from Lasso Regression

Following the insights gained from the Lasso regression, we have refined our model by removing ‘2ndFlrSF’, a feature identified as contributing minimally to the predictive power. This section evaluates the performance and stability of the coefficients in the revised model, using only ‘GrLivArea’, ‘1stFlrSF’, and ‘LowQualFinSF’.

The results of our refined multiple regression model can be demonstrated with the two plots below:

The box plot on the left illustrates the coefficients’ distribution across different folds of cross-validation. Notably, the variance in the coefficients appears reduced compared to previous models that included “2ndFlrSF.” This reduction in variability highlights the effectiveness of removing redundant features, which can help stabilize the model’s estimates and enhance its interpretability. Each feature’s coefficient now exhibits less fluctuation, suggesting that the model can consistently evaluate the importance of these features across various subsets of the data.

In addition to maintaining the model’s predictability, the reduction in feature complexity has significantly enhanced the interpretability of the model. With fewer variables, each contributing distinctly to the outcome, we can now more easily gauge the impact of these specific features on the sale price. This clarity allows for more straightforward interpretations and more confident decision-making based on the model’s output. Stakeholders can better understand how changes in “GrLivArea”, “1stFlrSF’, and “LowQualFinSF” are likely to affect property values, facilitating clearer communication and more actionable insights. This improved transparency is invaluable, particularly in fields where explaining model predictions is as important as the predictions themselves.

Further Reading

APIs

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

This blog post tackled the challenge of perfect multicollinearity in regression models, starting with its detection using matrix rank analysis in the Ames Housing dataset. We then explored Lasso regression to mitigate multicollinearity by reducing feature count, stabilizing coefficient estimates, and preserving model predictability. It concluded by refining the linear regression model and enhancing its interpretability and reliability through strategic feature reduction.

Specifically, you learned:

  • The use of matrix rank analysis to detect perfect multicollinearity in a dataset.
  • The application of Lasso regression to mitigate multicollinearity and assist in feature selection.
  • The refinement of a linear regression model using insights from Lasso to enhance interpretability.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on The Beginner’s Guide to Data Science!

The Beginner's Guide to Data ScienceThe Beginner's Guide to Data Science

Learn the mindset to become successful in data science projects

…using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner’s Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more…all to support you in creating a narrative from a dataset.

Kick-start your data science journey with hands-on exercises

See What’s Inside

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here