To understand housing prices better, simplicity and clarity in our models are key. Our aim with this post is to demonstrate how straightforward yet powerful techniques in feature selection and engineering can lead to creating an effective, simple linear regression model. Working with the Ames dataset, we use a Sequential Feature Selector (SFS) to identify the most impactful numeric features and then enhance our model’s accuracy through thoughtful feature engineering.
Let’s get started.
Overview
This post is divided into three parts; they are:
- Identifying the Most Predictive Numeric Feature
- Evaluating Individual Features’ Predictive Power
- Enhancing Predictive Accuracy with Feature Engineering
Identifying the Most Predictive Numeric Feature
In the initial segment of our exploration, we embark on a mission to identify the most predictive numeric feature within the Ames dataset. This is achieved by applying Sequential Feature Selector (SFS), a tool designed to sift through features and select the one that maximizes our model’s predictive accuracy. The process is straightforward, focusing solely on numeric columns and excluding any with missing values to ensure a clean and robust analysis:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Load only the numeric columns from the Ames dataset import pandas as pd Ames = pd.read_csv(‘Ames.csv’).select_dtypes(include=[‘int64’, ‘float64’])
# Drop any columns with missing values Ames = Ames.dropna(axis=1)
# Import Linear Regression and Sequential Feature Selector from scikit-learn from sklearn.linear_model import LinearRegression from sklearn.feature_selection import SequentialFeatureSelector
# Initializing the Linear Regression model model = LinearRegression()
# Perform Sequential Feature Selector sfs = SequentialFeatureSelector(model, n_features_to_select=1) X = Ames.drop(‘SalePrice’, axis=1) # Features y = Ames[‘SalePrice’] # Target variable sfs.fit(X,y) # Uses a default of cv=5 selected_feature = X.columns[sfs.get_support()] print(“Feature selected for highest predictability:”, selected_feature[0]) |
This will output:
Feature selected for highest predictability: OverallQual |
This result notably challenges the initial presumption that the area might be the most predictive feature for housing prices. Instead, it underscores the significance of overall quality, suggesting that, contrary to initial expectations, quality is the paramount consideration for buyers. It is important to note that the Sequential Feature Selector utilizes cross-validation with a default of five folds (cv=5) to evaluate the performance of each feature subset. This approach ensures that the selected feature—reflected by the highest mean cross-validation R² score—is robust and likely to generalize well on unseen data.
Evaluating Individual Features’ Predictive Power
Building upon our initial findings, we delve deeper to rank features by their predictive capabilities. Employing cross-validation, we evaluate each feature independently, calculating their mean R² scores from cross-validation to ascertain their individual contributions to the model’s accuracy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Building on the earlier block of code: from sklearn.model_selection import cross_val_score
# Dictionary to hold feature names and their corresponding mean CV R² scores feature_scores = {}
# Iterate over each feature, perform CV, and store the mean R² score for feature in X.columns: X_single = X[[feature]] cv_scores = cross_val_score(model, X_single, y, cv=5) feature_scores[feature] = cv_scores.mean()
# Sort features based on their mean CV R² scores in descending order sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True)
# Print the top 3 features and their scores top_3 = sorted_features[0:3] for feature, score in top_3: print(f“Feature: {feature}, Mean CV R²: {score:.4f}”) |
This will output:
Feature: OverallQual, Mean CV R²: 0.6183 Feature: GrLivArea, Mean CV R²: 0.5127 Feature: 1stFlrSF, Mean CV R²: 0.3957 |
These findings underline the key role of overall quality (“OverallQual”), as well as the importance of living area (“GrLivArea”) and first-floor space (“1stFlrSF”) in the context of housing price predictions.
Enhancing Predictive Accuracy with Feature Engineering
In the final stride of our journey, we employ feature engineering to create a novel feature, “Quality Weighted Area,” by multiplying ‘OverallQual’ by ‘GrLivArea’. This fusion aims to synthesize a more powerful predictor, encapsulating both the quality and size dimensions of a property.
# Building on the earlier blocks of code: Ames[‘QualityArea’] = Ames[‘OverallQual’] * Ames[‘GrLivArea’]
# Setting up the feature and target variable for the new ‘QualityArea’ feature X = Ames[[‘QualityArea’]] # New feature y = Ames[‘SalePrice’]
# 5-Fold CV on Linear Regression model = LinearRegression() cv_scores = cross_val_score(model, X, y, cv=5)
# Calculating the mean of the CV scores mean_cv_score = cv_scores.mean() print(f“Mean CV R² score using ‘Quality Weighted Area’: {mean_cv_score:.4f}”) |
This will output:
Mean CV R² score using ‘Quality Weighted Area’: 0.7484 |
This remarkable increase in R² score vividly demonstrates the efficacy of combining features to capture more nuanced aspects of data, providing a compelling case for the thoughtful application of feature engineering in predictive modeling.
Further Reading
APIs
Tutorials
Ames Housing Dataset & Data Dictionary
Summary
Through this three-part exploration, you have navigated the process of pinpointing and enhancing predictors for housing price predictions with an emphasis on simplicity. Starting with identifying the most predictive feature using a Sequential Feature Selector (SFS), we discovered that overall quality is paramount. This initial step was crucial, especially since our goal was to create the best simple linear regression model, leading us to exclude categorical features for a streamlined analysis. The exploration led us from identifying overall quality as the key predictor using Sequential Feature Selector (SFS) to evaluating the impacts of living area and first-floor space. Creating “Quality Weighted Area,” a feature blending quality with size, notably enhanced our model’s accuracy. The journey through feature selection and engineering underscored the power of simplicity in improving real estate predictive models, offering deeper insights into what truly influences housing prices. This exploration emphasizes that with the right techniques, even simple models can yield profound insights into complex datasets like Ames’ housing prices.
Specifically, you learned:
- The value of Sequential Feature Selection in revealing the most important predictors for housing prices.
- The importance of quality over size when predicting housing prices in Ames, Iowa.
- How merging features into a “Quality Weighted Area” enhances model accuracy.
Do you have experiences with feature selection or engineering you would like to share, or questions about the process? Please ask your questions or give us feedback in the comments below, and I will do my best to answer.