Statistics and Machine Learning both aim to extract insights from data, though their approaches differ significantly. Traditional statistics primarily concerns itself with inference, using the entire dataset to test hypotheses and estimate probabilities about a larger population. In contrast, machine learning emphasizes prediction and decision-making, typically employing a train-test split methodology where models learn from a portion of the data (the training set) and validate their predictions on unseen data (the testing set).
In this post, we will demonstrate how a seemingly straightforward technique like linear regression can be viewed through these two lenses. We will explore their unique contributions by using Scikit-Learn for machine learning and Statsmodels for statistical inference.
Let’s get started.
Overview
This post is divided into three parts; they are:
- Supervised Learning: Classification vs. Regression
- Diving into Regression with a Machine Learning Focus
- Enhancing Understanding with Statistical Insights
Supervised Learning: Classification vs. Regression
Supervised learning is a branch of machine learning where the model is trained on a labeled dataset. This means that each example in the training dataset is paired with the correct output. Once trained, the model can apply what it has learned to new, unseen data.
In supervised learning, we encounter two main tasks: classification and regression. These tasks are determined by the type of output we aim to predict. If the goal is to predict categories, such as determining if an email is spam, we are dealing with a classification task. Alternatively, if we estimate a value, such as calculating the miles per gallon (MPG) a car will achieve based on its features, it falls under regression. The output’s nature — a category or a number — steers us toward the appropriate approach.
In this series, we will used the Ames housing dataset. It provides a comprehensive collection of features related to houses, including architectural details, condition, and location, aimed at predicting the “SalePrice” (the sales price) of each house.
# Load the Ames dataset import pandas as pd Ames = pd.read_csv(‘Ames.csv’)
# Display the first few rows of the dataset and the data type of ‘SalePrice’ print(Ames.head())
sale_price_dtype = Ames[‘SalePrice’].dtype print(f“The data type of ‘SalePrice’ is sale_price_dtype.”) |
This should output:
PID GrLivArea SalePrice … Prop_Addr Latitude Longitude 0 909176150 856 126000 … 436 HAYWARD AVE 42.018564 -93.651619 1 905476230 1049 139500 … 3416 WEST ST 42.024855 -93.663671 2 911128020 1001 124900 … 320 S 2ND ST 42.021548 -93.614068 3 535377150 1039 114000 … 1524 DOUGLAS AVE 42.037391 -93.612207 4 534177230 1665 227000 … 2304 FILLMORE AVE 42.044554 -93.631818 [5 rows x 85 columns]
The data type of ‘SalePrice’ is int64. |
The “SalePrice” column is of data type int64
, indicating that it represents integer values. Since “SalePrice” is a numerical (continuous) variable rather than categorical, predicting the “SalePrice” would be a regression task. This means the goal is to predict a continuous quantity (the sale price of a house) based on the input features provided in your dataset.
Diving into Regression with a Machine Learning Focus
Supervised learning in machine learning focuses on predicting outcomes based on input data. In our case, using the Ames Housing dataset, we aim to predict a house’s sale price from its living area—a classic regression task. For this, we turn to scikit-learn, renowned for its simplicity and effectiveness in building predictive models.
To start, we select “GrLivArea” (ground living area) as our feature and “SalePrice” as the target. The next step involves splitting our dataset into training and testing sets using scikit-learn’s train_test_split()
function. This crucial step allows us to train our model on one set of data and evaluate its performance on another, ensuring the model’s reliability.
Here’s how we do it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Import Linear Regression from scikit-learn from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split
# Select features and target X = Ames[[‘GrLivArea’]] # Feature: GrLivArea, 2D matrix y = Ames[‘SalePrice’] # Target: SalePrice, 1D vector
# Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and fit the Linear Regression model model = LinearRegression() model.fit(X_train, y_train)
# Scoring the model score = round(model.score(X_test, y_test), 4) print(f“Model R^2 Score: score”) |
This should output:
The LinearRegression
object imported in the code above is scikit-learn’s implementation of linear regression. The model’s R² score of 0.4789 indicates that our model explains approximately 48% of the variation in sale prices based on the living area alone—a significant insight for such a simple model. This step marks our initial foray into machine learning with scikit-learn, showcasing the ease with which we can assess model performance on unseen or test data.
Enhancing Understanding with Statistical Insights
After exploring how scikit-learn can help us assess model performance on unseen data, we now turn our attention to statsmodels
, a Python package that offers a different angle of analysis. While scikit-learn excels in building models and predicting outcomes, statsmodels
shines by diving deep into the statistical aspects of our data and model. Let’s see how statsmodels
can provide you with insight at a different level:
import statsmodels.api as sm
# Adding a constant to our independent variable for the intercept X_with_constant = sm.add_constant(X)
# Fit the OLS model model_stats = sm.OLS(y, X_with_constant).fit()
# Print the summary of the model print(model_stats.summary()) |
The first key distinction to highlight is statsmodels
‘ use of all observations in our dataset. Unlike the predictive modeling approach, where we split our data into training and testing sets, statsmodels
leverages the entire dataset to provide comprehensive statistical insights. This full utilization of data allows for a detailed understanding of the relationships between variables and enhances the accuracy of our statistical estimates. The above code should output the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.518 Model: OLS Adj. R-squared: 0.518 Method: Least Squares F-statistic: 2774. Date: Sun, 31 Mar 2024 Prob (F-statistic): 0.00 Time: 19:59:01 Log-Likelihood: -31668. No. Observations: 2579 AIC: 6.334e+04 Df Residuals: 2577 BIC: 6.335e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] —————————————————————————— const 1.377e+04 3283.652 4.195 0.000 7335.256 2.02e+04 GrLivArea 110.5551 2.099 52.665 0.000 106.439 114.671 ============================================================================== Omnibus: 566.257 Durbin-Watson: 1.926 Prob(Omnibus): 0.000 Jarque-Bera (JB): 3364.083 Skew: 0.903 Prob(JB): 0.00 Kurtosis: 8.296 Cond. No. 5.01e+03 ==============================================================================
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 5.01e+03. This might indicate that there are strong multicollinearity or other numerical problems. |
Note that it is not the same regerssion as in the case of scikit-learn because the full dataset is used without train-test split.
Let’s dive into the statsmodels
‘ output for our OLS regression and explain what the P-values, coefficients, confidence intervals, and diagnostics tell us about our model, specifically focusing on predicting “SalePrice” from “GrLivArea”:
P-values and Coefficients
- Coefficient of “GrLivArea”: The coefficient for “GrLivArea” is 110.5551. This means that for every additional square foot of living area, the sales price of the house is expected to increase by approximately $110.55. This coefficient quantifies the impact of living area size on the house’s sales price.
- P-value for “GrLivArea”: The p-value associated with the “GrLivArea” coefficient is essentially 0 (indicated by
P>|t|
near 0.000), suggesting that the living area is a highly significant predictor of the sales price. In statistical terms, we can reject the null hypothesis that the coefficient is zero (no effect) and confidently state that there is a strong relationship between the living area and sales price (but not necessarily the only factor).
Confidence Intervals
- Confidence Interval for “GrLivArea”: The confidence interval for the “GrLivArea” coefficient is [106.439, 114.671]. This range tells us that we can be 95% confident that the true impact of living area on sale price falls within this interval. It offers a measure of the precision of our coefficient estimate.
Diagnostics
- R-squared (R²): The R² value of 0.518 indicates that the living area can explain approximately 51.8% of the variability in sale prices. It’s a measure of how well the model fits the data. It is expected that this number is not the same as the case in scikit-learn regression since the data is different.
- F-statistic and Prob (F-statistic): The F-statistic is a measure of the overall significance of the model. With an F-statistic of 2774 and a Prob (F-statistic) essentially at 0, this indicates that the model is statistically significant.
- Omnibus, Prob(Omnibus): These tests assess the normality of the residuals. Residual is the difference between the predicted value $\haty$) and the actual value ($y$). The linear regression algorithm is based on the assumption that the residuals are normally distributed. A Prob(Omnibus) value close to 0 suggests the residuals are not normally distributed, which could be a concern for the validity of some statistical tests.
- Durbin-Watson: The Durbin-Watson statistic tests the presence of autocorrelation in the residuals. It is between 0 and 4. A value close to 2 (1.926) suggests there is no strong autocorrelation. Otherwise, this suggests that the relationship between $X$ and $y$ may not be linear.
This comprehensive output from statsmodels
provides a deep understanding of how and why “GrLivArea” influences “SalePrice,” backed by statistical evidence. It underscores the importance of not just using models for predictions but also interpreting them to make informed decisions based on a solid statistical foundation. This insight is invaluable for those looking to explore the statistical story behind their data.
Further Reading
APIs
Tutorials
Books
Ames Housing Dataset & Data Dictionary
Summary
In this post, we navigated through the foundational concepts of supervised learning, specifically focusing on regression analysis. Using the Ames Housing dataset, we demonstrated how to employ scikit-learn
for model building and performance, and statsmodels
for gaining statistical insights into our data. This journey from data to insights underscores the critical role of both predictive modeling and statistical analysis in understanding and leveraging data effectively.
Specifically, you learned:
- The distinction between classification and regression tasks in supervised learning.
- How to identify which approach to use based on the nature of your data.
- How to use
scikit-learn
to implement a simple linear regression model, assess its performance, and understand the significance of the model’s R² score. - The value of employing
statsmodels
to explore the statistical aspects of your data, including the interpretation of coefficients, p-values, and confidence intervals, and the importance of diagnostic tests for model assumptions.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.