One of the significant challenges statisticians and data scientists face is multicollinearity, particularly its most severe form, perfect multicollinearity. This issue often lurks undetected in large datasets with many features, potentially disguising itself and skewing the results of statistical models.
In this post, we explore the methods for detecting, addressing, and refining models affected by perfect multicollinearity. Through practical analysis and examples, we aim to equip you with the tools necessary to enhance your models’ robustness and interpretability, ensuring that they deliver reliable insights and accurate predictions.
Let’s get started.
Overview
This post is divided into three parts; they are:
- Exploring the Impact of Perfect Multicollinearity on Linear Regression Models
- Addressing Multicollinearity with Lasso Regression
- Refining the Linear Regression Model Using Insights from Lasso Regression
Exploring the Impact of Perfect Multicollinearity on Linear Regression Models
Multiple linear regression is particularly valued for its interpretability. It allows a direct understanding of how each predictor impacts the response variable. However, its effectiveness hinges on the assumption of independent features.
Collinearity means that a variable can be expressed as a linear combination of some other variables. Hence, the variables are not independent of each other.
Linear regression works under the assumption that the feature set has no collinearity. To ensure this assumption holds, understanding a core concept in linear algebra—the rank of a matrix—is vital. In linear regression, the rank reveals the linear independence of features. Essentially, no feature should be a direct linear combination of another. This independence is crucial because dependencies among features—where the rank is less than the number of features—lead to perfect multicollinearity. This condition can distort the interpretability and reliability of a regression model, impacting its utility in making informed decisions.
Let’s explore this with the Ames Housing dataset. We will examine the dataset’s rank and the number of features to detect multicollinearity.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Import necessary libraries to check and compare number of columns vs rank of dataset import pandas as pd import numpy as np
# Load the dataset Ames = pd.read_csv(‘Ames.csv’)
# Select numerical columns without missing values numerical_data = Ames.select_dtypes(include=[np.number]).dropna(axis=1)
# Calculate the matrix rank rank = np.linalg.matrix_rank(numerical_data.values)
# Number of features num_features = numerical_data.shape[1]
# Print the rank and the number of features print(f“Numerical features without missing values: {num_features}”) print(f“Rank: {rank}”) |
Our preliminary results show that the Ames Housing dataset has multicollinearity, with 27 features but only a rank of 26:
Numerical features without missing values: 27 Rank: 26 |
To address this, let’s identify the redundant features using a tailored function. This approach helps make informed decisions about feature selection or modifications to enhance model reliability and interpretability.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# Creating and using a function to identify redundant features in a dataset import pandas as pd import numpy as np
def find_redundant_features(data): “”“ Identifies and returns redundant features in a dataset based on matrix rank. A feature is considered redundant if removing it does not decrease the rank of the dataset, indicating that it can be expressed as a linear combination of other features.
Parameters: data (DataFrame): The numerical dataset to analyze.
Returns: list: A list of redundant feature names. ““”
# Calculate the matrix rank of the original dataset original_rank = np.linalg.matrix_rank(data) redundant_features = []
for column in data.columns: # Create a new dataset without this column temp_data = data.drop(column, axis=1) # Calculate the rank of the new dataset temp_rank = np.linalg.matrix_rank(temp_data)
# If the rank does not decrease, the removed column is redundant if temp_rank == original_rank: redundant_features.append(column)
return redundant_features
# Usage of the function with the numerical data Ames = pd.read_csv(‘Ames.csv’) numerical_data = Ames.select_dtypes(include=[np.number]).dropna(axis=1) redundant_features = find_redundant_features(numerical_data) print(“Redundant features:”, redundant_features) |
The following features have been identified as redundant, indicating that they do not contribute uniquely to the predictive power of the model:
Redundant features: [‘GrLivArea’, ‘1stFlrSF’, ‘2ndFlrSF’, ‘LowQualFinSF’] |
Having identified redundant features in our dataset, it is crucial to understand the nature of their redundancy. Specifically, we suspect that ‘GrLivArea’ may simply be a sum of the first floor area (“1stFlrSF”), second floor area (“2ndFlrSF”), and low-quality finished square feet (“LowQualFinSF”). To verify this, we will calculate the total of these three areas and compare it directly with “GrLivArea” to confirm if they are indeed identical.
#import pandas import pandas as pd
# Load the dataset Ames = pd.read_csv(‘Ames.csv’)
# Calculate the sum of ‘1stFlrSF’, ‘2ndFlrSF’, and ‘LowQualFinSF’ Ames[‘CalculatedGrLivArea’] = Ames[‘1stFlrSF’] + Ames[‘2ndFlrSF’] + Ames[‘LowQualFinSF’]
# Compare the calculated sum with the existing ‘GrLivArea’ column to see if they are the same Ames[‘IsEqual’] = Ames[‘GrLivArea’] == Ames[‘CalculatedGrLivArea’]
# Output the percentage of rows where the values match match_percentage = Ames[‘IsEqual’].mean() * 100 print(f“Percentage of rows where GrLivArea equals the sum of the other three features: {int(match_percentage)}%”) |
Our analysis confirms that “GrLivArea” is precisely the sum of “1stFlrSF”, “2ndFlrSF”, and “LowQualFinSF” in 100% of the cases in the dataset:
Percentage of rows where GrLivArea equals the sum of the other three features: 100% |
Having established the redundancy of “GrLivArea” through matrix rank analysis, we now aim to visualize the effects of multicollinearity on our regression model’s stability and predictive power. The following steps will involve running a Multiple Linear Regression using the redundant features to observe the variance in coefficient estimates. This exercise will help demonstrate the practical impact of multicollinearity in a tangible way, reinforcing the need for careful feature selection in model building.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
# Import necessary libraries import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import KFold import matplotlib.pyplot as plt
# Load the data Ames = pd.read_csv(‘Ames.csv’) features = [‘GrLivArea’, ‘1stFlrSF’, ‘2ndFlrSF’, ‘LowQualFinSF’] X = Ames[features] y = Ames[‘SalePrice’]
# Initialize a K-Fold cross-validation kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Collect coefficients and CV scores coefficients = [] cv_scores = []
for train_index, test_index in kf.split(X): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Initialize and fit the linear regression model model = LinearRegression() model.fit(X_train, y_train) coefficients.append(model.coef_)
# Calculate R^2 score using the model’s score method score = model.score(X_test, y_test) # print(score) cv_scores.append(score)
# Plotting the coefficients plt.figure(figsize=(12, 6)) plt.subplot(1, 2, 1) plt.boxplot(np.array(coefficients), labels=features) plt.title(‘Box Plot of Coefficients Across Folds (MLR)’) plt.xlabel(‘Features’) plt.ylabel(‘Coefficient Value’) plt.grid(True)
# Plotting the CV scores plt.subplot(1, 2, 2) plt.plot(range(1, 6), cv_scores, marker=‘o’, linestyle=‘-‘) # Adjusted x-axis to start from 1 plt.title(‘Cross-Validation R² Scores (MLR)’) plt.xlabel(‘Fold’) plt.xticks(range(1, 6)) # Set x-ticks to match fold numbers plt.ylabel(‘R² Score’) plt.ylim(min(cv_scores) – 0.05, max(cv_scores) + 0.05) # Dynamically adjust y-axis limits plt.grid(True)
# Annotate mean R² score mean_r2 = np.mean(cv_scores) plt.annotate(f‘Mean CV R²: {mean_r2:.3f}’, xy=(1.25, 0.65), color=‘red’, fontsize=14),
plt.tight_layout() plt.show() |
The results can be demonstrated with the two plots below:
The box plot on the left illustrates the substantial variance in the coefficient estimates. This significant spread in values not only points to the instability of our model but also directly challenges its interpretability. Multiple linear regression is particularly valued for its interpretability, which hinges on its coefficients’ stability and consistency. When coefficients vary widely from one data subset to another, it becomes difficult to derive clear and actionable insights, which are essential for making informed decisions based on the model’s predictions. Given these challenges, a more robust approach is needed to address the variability and instability in our model’s coefficients.
Addressing Multicollinearity with Lasso Regression
Lasso regression presents itself as a robust solution. Unlike multiple linear regression, Lasso can penalize the coefficients’ size and, crucially, set some coefficients to zero, effectively reducing the number of features in the model. This feature selection is particularly beneficial in mitigating multicollinearity. Let’s apply Lasso to our previous example to demonstrate this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
# Import necessary libraries import numpy as np import pandas as pd from sklearn.linear_model import Lasso from sklearn.preprocessing import StandardScaler from sklearn.model_selection import KFold import matplotlib.pyplot as plt
# Load the data Ames = pd.read_csv(‘Ames.csv’) features = [‘GrLivArea’, ‘1stFlrSF’, ‘2ndFlrSF’, ‘LowQualFinSF’] X = Ames[features] y = Ames[‘SalePrice’]
# Initialize a K-Fold cross-validation kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Prepare to collect results results = {}
for alpha in [1, 2]: # Loop through both alpha values coefficients = [] cv_scores = []
for train_index, test_index in kf.split(X): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Scale features scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)
# Initialize and fit the Lasso regression model lasso_model = Lasso(alpha=alpha, max_iter=20000) lasso_model.fit(X_train_scaled, y_train) coefficients.append(lasso_model.coef_)
# Calculate R^2 score using the model’s score method score = lasso_model.score(X_test_scaled, y_test) cv_scores.append(score)
results[alpha] = (coefficients, cv_scores)
# Plotting the results fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 12)) alphas = [1, 2]
for i, alpha in enumerate(alphas): coefficients, cv_scores = results[alpha]
# Plotting the coefficients axes[i, 0].boxplot(np.array(coefficients), labels=features) axes[i, 0].set_title(f‘Box Plot of Coefficients (Lasso with alpha={alpha})’) axes[i, 0].set_xlabel(‘Features’) axes[i, 0].set_ylabel(‘Coefficient Value’) axes[i, 0].grid(True)
# Plotting the CV scores axes[i, 1].plot(range(1, 6), cv_scores, marker=‘o’, linestyle=‘-‘) axes[i, 1].set_title(f‘Cross-Validation R² Scores (Lasso with alpha={alpha})’) axes[i, 1].set_xlabel(‘Fold’) axes[i, 1].set_xticks(range(1, 6)) axes[i, 1].set_ylabel(‘R² Score’) axes[i, 1].set_ylim(min(cv_scores) – 0.05, max(cv_scores) + 0.05) axes[i, 1].grid(True) mean_r2 = np.mean(cv_scores) axes[i, 1].annotate(f‘Mean CV R²: {mean_r2:.3f}’, xy=(1.25, 0.65), color=‘red’, fontsize=12)
plt.tight_layout() plt.show() |
By varying the regularization strength (alpha), we can observe how increasing the penalty affects the coefficients and the predictive accuracy of the model:
The box plots on the left show that as alpha increases, the spread and magnitude of the coefficients decrease, indicating more stable estimates. Notably, the coefficient for ‘2ndFlrSF’ begins to approach zero as alpha is set to 1 and is virtually zero when alpha increases to 2. This trend suggests that ‘2ndFlrSF’ contributes minimally to the model as the regularization strength is heightened, indicating that it may be redundant or collinear with other features in the model. This stabilization is a direct result of Lasso’s ability to reduce the influence of less important features, which are likely contributing to multicollinearity.
The fact that ‘2ndFlrSF’ can be removed with minimal impact on the model’s predictability is significant. It underscores the efficiency of Lasso in identifying and eliminating unnecessary predictors. Importantly, the overall predictability of the model remains unchanged even as this feature is effectively zeroed out, demonstrating the robustness of Lasso in maintaining model performance while simplifying its complexity.
Refining the Linear Regression Model Using Insights from Lasso Regression
Following the insights gained from the Lasso regression, we have refined our model by removing ‘2ndFlrSF’, a feature identified as contributing minimally to the predictive power. This section evaluates the performance and stability of the coefficients in the revised model, using only ‘GrLivArea’, ‘1stFlrSF’, and ‘LowQualFinSF’.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import KFold
# Load the data Ames = pd.read_csv(‘Ames.csv’) features = [‘GrLivArea’, ‘1stFlrSF’, ‘LowQualFinSF’] # Remove ‘2ndFlrSF’ after running Lasso X = Ames[features] y = Ames[‘SalePrice’]
# Initialize a K-Fold cross-validation kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Collect coefficients and CV scores coefficients = [] cv_scores = []
for train_index, test_index in kf.split(X): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Initialize and fit the linear regression model model = LinearRegression() model.fit(X_train, y_train) coefficients.append(model.coef_)
# Calculate R^2 score using the model’s score method score = model.score(X_test, y_test) # print(score) cv_scores.append(score)
# Plotting the coefficients plt.figure(figsize=(12, 6)) plt.subplot(1, 2, 1) plt.boxplot(np.array(coefficients), labels=features) plt.title(‘Box Plot of Coefficients Across Folds (MLR)’) plt.xlabel(‘Features’) plt.ylabel(‘Coefficient Value’) plt.grid(True)
# Plotting the CV scores plt.subplot(1, 2, 2) plt.plot(range(1, 6), cv_scores, marker=‘o’, linestyle=‘-‘) # Adjusted x-axis to start from 1 plt.title(‘Cross-Validation R² Scores (MLR)’) plt.xlabel(‘Fold’) plt.xticks(range(1, 6)) # Set x-ticks to match fold numbers plt.ylabel(‘R² Score’) plt.ylim(min(cv_scores) – 0.05, max(cv_scores) + 0.05) # Dynamically adjust y-axis limits plt.grid(True)
# Annotate mean R² score mean_r2 = np.mean(cv_scores) plt.annotate(f‘Mean CV R²: {mean_r2:.3f}’, xy=(1.25, 0.65), color=‘red’, fontsize=14),
plt.tight_layout() plt.show() |
The results of our refined multiple regression model can be demonstrated with the two plots below:
The box plot on the left illustrates the coefficients’ distribution across different folds of cross-validation. Notably, the variance in the coefficients appears reduced compared to previous models that included “2ndFlrSF.” This reduction in variability highlights the effectiveness of removing redundant features, which can help stabilize the model’s estimates and enhance its interpretability. Each feature’s coefficient now exhibits less fluctuation, suggesting that the model can consistently evaluate the importance of these features across various subsets of the data.
In addition to maintaining the model’s predictability, the reduction in feature complexity has significantly enhanced the interpretability of the model. With fewer variables, each contributing distinctly to the outcome, we can now more easily gauge the impact of these specific features on the sale price. This clarity allows for more straightforward interpretations and more confident decision-making based on the model’s output. Stakeholders can better understand how changes in “GrLivArea”, “1stFlrSF’, and “LowQualFinSF” are likely to affect property values, facilitating clearer communication and more actionable insights. This improved transparency is invaluable, particularly in fields where explaining model predictions is as important as the predictions themselves.
Further Reading
APIs
Tutorials
Ames Housing Dataset & Data Dictionary
Summary
This blog post tackled the challenge of perfect multicollinearity in regression models, starting with its detection using matrix rank analysis in the Ames Housing dataset. We then explored Lasso regression to mitigate multicollinearity by reducing feature count, stabilizing coefficient estimates, and preserving model predictability. It concluded by refining the linear regression model and enhancing its interpretability and reliability through strategic feature reduction.
Specifically, you learned:
- The use of matrix rank analysis to detect perfect multicollinearity in a dataset.
- The application of Lasso regression to mitigate multicollinearity and assist in feature selection.
- The refinement of a linear regression model using insights from Lasso to enhance interpretability.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.