Understanding Statistical Learning Part 1 — A Comprehensive Guide to Simple Linear Regression in Python | by Julie Le Rudulier | Apr, 2024


Linear regression is a very simple approach for supervised learning and a useful tool for predicting a quantitative response. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches, linear regression is still a useful and widely used statistical learning method, and many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression.

Simple linear regression is a very straightforward approach for predicting a quantitative response Y on the basis of a single predictor variable X. It assumes that there is some relationship between X and Y, which can be written in the very general form

Y = ⨍(X) + ε,

where ⨍ is some fixed but unknown function of X₁, . . . , Xₙ, and ε is a random error term which is independent of X and has mean 0. Mathematically, we can write this linear relationship as

Y = β₀ + β₁X + ε,

where β₀ and β₁ are two unknown constants that represent the intercept term — that is, the expected value of Y when X = 0, and the slope term — the average increase in Y associated with a one-unit increase in X in the linear model, respectively. Together, β₀ and β₁ are known as the model coefficients or parameters.

Recall that ε is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in Y, and there may be measurement error.

Figure 1 – A Simple Linear Regression Model by Author

In the example above, the plot displays salary as a function of years of professional experience, where the red line represents a simple model that can be used to predict salary using the years of experience.

While the model given by the equation Y = β₀ + β₁X + ε defines the population regression line, which is the best linear approximation to the true relationship between X and Y, in real-world applications the population regression line is unobserved, as the true relationship is generally not known for real data. But even though we don’t know what the values of β₀ and β₁ are, we do have access to a set of observations which we can use to estimate our model’s coefficients.

Suppose (x₁, y₁), (x₂, y₂), …, (xₙ, yₙ) represent n observation pairs, each of which consists of a measurement of X and a measurement of Y. Our goal is to obtain coefficient estimates βˆ₀ and βˆ₁ such that the linear model fits the available data well — that is, so that yᵢ ≈ βˆ₀ + βˆ₁xᵢ for i = 1, . . . , n.

There are a number of ways of measuring closeness. However, by far the most common approach involves minimizing the least squares criterion.

Imagine ŷᵢ = βˆ₀ + βˆ₁xᵢ being the prediction for Y based on the iₜₕ value of X. Then eᵢ = yᵢ − ŷᵢ represents the iₜₕ residual, i.e. the difference between the iₜₕ observed response value and the iₜₕ response value that is predicted by our linear model.

Recall from Figure 1 that the red line represents a model to predict salary using years of experience as its predictor. Figure 1 shows the simple least squares fit of salary to that variable.

So for each observation (xₙ, yₙ), the absolute residual rᵢ = |yᵢ — ŷᵢ| quantifies the error — that is, the difference between the true value and the prediction, as shown on Figure 2 below.

Figure 2 — The Least Squares Fit For The Regression of Salary Onto Years of Experience by Author

The least squares approach chooses βˆ₀ and βˆ₁ to minimize what’s called the Residual Sum of Squares (RSS). The RSS is a measure of the discrepancy between the data and an estimation model, and it can be defined as

or equivalently as

Using some calculus, we can find the values of the minimizers

where ȳ ​and x̄ are the sample means.

These two equations define the least squares coefficient estimates for simple linear regression, and the line defined by the equation

is called the regression line.

The process of determining the coefficient estimates is called fitting, or training the model. Once we have fitted the model, we can use it to predict the value of y at any given x.

The scikit-learn library in Python implements linear regression through the LinearRegression class. This class allows us to fit a linear model to a dataset, predict new values, and evaluate the model’s performance.

To use the LinearRegression class, we first need to import it from sklearn.linear_model module.

from sklearn.linear_model import LinearRegression

We can then use pandas and the method read_csv to read the .csv file containing the data and assign it to a data frame, which we will call here ‘df’.

df = pd.read_csv('Salary.csv')
df.head()

We identify our predictor and our response variable and assign them to X and y, respectively.

X = df[['Experience Years']].values 
y = df['Salary'].values

We instantiate the model by using linear regression and call it ‘model’.

model = LinearRegression()

This statement creates the variable model as an instance of LinearRegression.

We then use that model to fit the data. Recall that here we calculate the optimal values of βˆ₀ and βˆ₁ using the existing input and output X and y as arguments. In Python, ‘.fit()’ fits the model and returns ‘self’, which is the variable model itself.

model.fit(X, y)

Once we have our model fitted, we can take a look at our coefficients and interpret them.

model.intercept_
>>> 25673.015760530274

model.coef_
>>> array[[9523.65050742]])

As we can see here, βˆ0 = 25,673.01576 and βˆ1 = 9,523.65051. According to this approximation, having no professional experience at all is associated with a salary of approximately $25,673, while an additional year of experience is associated with earning approximately $9,524 more.

We can now predict the value of Å· at any given x, using the method .predict().

y_pred = model.predict(X)

When applying this method, we pass the predictor as the argument and get the corresponding predicted response. Another nearly identical way to predict Å· would be

y_pred = model.intercept_ + model.coef_ * X

The output here differs from the previous line of code only in dimensions: The predicted response is now a two-dimensional array, while in the previous case, it had one dimension. If we reduce the number of dimensions of X to one, then these two approaches will yield the same result.

Now we can plot our predictions and visualize how well our model fits our data using the Seaborn and Matplotlib libraries as follows

import seaborn as sns
import matplotlib.pyplot as plt

ax = sns.scatterplot(data=df[['Experience Years', 'Salary']],
x='Experience Years',
y='Salary',
alpha=0.8)

ax = plt.plot(X, y_pred, color='red')
plt.legend(labels=["Data", "Model"])
plt.show()

This yields the plot displayed as Figure 1.

The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the R² statistic.

Recall from the model Y = β₀ + β₁X + ε that associated with each observation is an error term ε. Due to the presence of these error terms, even if we knew the true regression line, i.e. even if β₀ and β₁ were known, we would not be able to perfectly predict Y from X. The RSE is an estimate of the standard deviation of ε. Roughly speaking, it is the average amount that the response will deviate from the true regression line. It is computed using the formula

The RSE provides an absolute measure of lack of fit of the model to the data, but since it is measured in the units of Y, it’s not always clear what constitutes a good RSE.

The R² statistic provides an alternative measure of fit. It takes the form of a proportion—the proportion of variance explained—and so it always takes on a value between 0 and 1, and is independent of the scale of Y. To calculate R², we use the formula

where TSS = Σ(yᵢ − ȳ)² is the Total Sum of Squares, and RSS is the Residual Sum of Squares as seen earlier.

TSS measures the total variance in the response Y, and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. Hence, TSS − RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and R² measures the proportion of variability in Y that can be explained using X.

An R² statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. On the other hand a number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance σ² is high, or both.

In Python, we can compute R² using the sklearn.metrics library and import r2_score.

from sklearn.metrics import r2_score
r2 = r2_score(y, y_pred)
print(round(r2, 3))
>>> 0.956

In our example, the variability in salary seems to be largely explained by a linear regression on professional experience.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here