Understanding Statistical Learning Part 1 â A Comprehensive Guide to Simple Linear Regression in Python | by Julie Le Rudulier | Apr, 2024

Linear regression is a very simple approach for supervised learning and a useful tool for predicting a quantitative response. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches, linear regression is still a useful and widely used statistical learning method, and many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression.

Simple linear regression is a very straightforward approach for predicting a quantitative response Y on the basis of a single predictor variable X. It assumes that there is some relationship between X and Y, which can be written in the very general form

Y = â¨(X) + Îµ,

where â¨ is some fixed but unknown function of Xâ, . . . , Xâ, and Îµ is a random error term which is independent of X and has mean 0. Mathematically, we can write this linear relationship as

Y = Î²â + Î²âX + Îµ,

where Î²â and Î²â are two unknown constants that represent the intercept term â that is, the expected value of Y when X = 0, and the slope term â the average increase in Y associated with a one-unit increase in X in the linear model, respectively. Together, Î²â and Î²â are known as the model coefficients or parameters.

Recall that Îµ is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in Y, and there may be measurement error.

Figure 1 – A Simple Linear Regression Model by Author

In the example above, the plot displays salary as a function of years of professional experience, where the red line represents a simple model that can be used to predict salary using the years of experience.

While the model given by the equation Y = Î²â + Î²âX + Îµ defines the population regression line, which is the best linear approximation to the true relationship between X and Y, in real-world applications the population regression line is unobserved, as the true relationship is generally not known for real data. But even though we donât know what the values of Î²â and Î²â are, we do have access to a set of observations which we can use to estimate our modelâs coefficients.

Suppose (xâ, yâ), (xâ, yâ), â¦, (xâ, yâ) represent n observation pairs, each of which consists of a measurement of X and a measurement of Y. Our goal is to obtain coefficient estimates Î²Ëâ and Î²Ëâ such that the linear model fits the available data well â that is, so that yáµ¢ â Î²Ëâ + Î²Ëâxáµ¢ for i = 1, . . . , n.

There are a number of ways of measuring closeness. However, by far the most common approach involves minimizing the least squares criterion.

Imagine Å·áµ¢ = Î²Ëâ + Î²Ëâxáµ¢ being the prediction for Y based on the iââ value of X. Then eáµ¢ = yáµ¢ â Å·áµ¢ represents the iââ residual, i.e. the difference between the iââ observed response value and the iââ response value that is predicted by our linear model.

Recall from Figure 1 that the red line represents a model to predict salary using years of experience as its predictor. Figure 1 shows the simple least squares fit of salary to that variable.

So for each observation (xâ, yâ), the absolute residual ráµ¢ = |yáµ¢ â Å·áµ¢| quantifies the error â that is, the difference between the true value and the prediction, as shown on Figure 2 below.

Figure 2 â The Least Squares Fit For The Regression of Salary Onto Years of Experience by Author

The least squares approach chooses Î²Ëâ and Î²Ëâ to minimize whatâs called the Residual Sum of Squares (RSS). The RSS is a measure of the discrepancy between the data and an estimation model, and it can be defined as

or equivalently as

Using some calculus, we can find the values of the minimizers

where yÌ âand xÌ are the sample means.

These two equations define the least squares coefficient estimates for simple linear regression, and the line defined by the equation

is called the regression line.

The process of determining the coefficient estimates is called fitting, or training the model. Once we have fitted the model, we can use it to predict the value of y at any given x.

The scikit-learn library in Python implements linear regression through the LinearRegression class. This class allows us to fit a linear model to a dataset, predict new values, and evaluate the modelâs performance.

To use the LinearRegression class, we first need to import it from sklearn.linear_model module.

from sklearn.linear_model import LinearRegression

We can then use pandas and the method read_csv to read the .csv file containing the data and assign it to a data frame, which we will call here âdfâ.

df = pd.read_csv('Salary.csv')
df.head()

We identify our predictor and our response variable and assign them to X and y, respectively.

X = df[['Experience Years']].values 
y = df['Salary'].values

We instantiate the model by using linear regression and call it âmodelâ.

model = LinearRegression()

This statement creates the variable model as an instance of LinearRegression.

We then use that model to fit the data. Recall that here we calculate the optimal values of Î²Ëâ and Î²Ëâ using the existing input and output X and y as arguments. In Python, â.fit()â fits the model and returns âselfâ, which is the variable model itself.

model.fit(X, y)

Once we have our model fitted, we can take a look at our coefficients and interpret them.

model.intercept_
>>> 25673.015760530274model.coef_
>>> array[[9523.65050742]])

As we can see here, Î²Ë0 = 25,673.01576 and Î²Ë1 = 9,523.65051. According to this approximation, having no professional experience at all is associated with a salary of approximately $25,673, while an additional year of experience is associated with earning approximately $9,524 more.

We can now predict the value of Å· at any given x, using the method .predict().

y_pred = model.predict(X)

When applying this method, we pass the predictor as the argument and get the corresponding predicted response. Another nearly identical way to predict Å· would be

y_pred = model.intercept_ + model.coef_ * X

The output here differs from the previous line of code only in dimensions: The predicted response is now a two-dimensional array, while in the previous case, it had one dimension. If we reduce the number of dimensions of X to one, then these two approaches will yield the same result.

Now we can plot our predictions and visualize how well our model fits our data using the Seaborn and Matplotlib libraries as follows

import seaborn as sns
import matplotlib.pyplot as pltax = sns.scatterplot(data=df[['Experience Years', 'Salary']], 
x='Experience Years', 
y='Salary', 
alpha=0.8)
ax = plt.plot(X, y_pred, color='red')
plt.legend(labels=["Data", "Model"])
plt.show()

This yields the plot displayed as Figure 1.

The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the RÂ² statistic.

Recall from the model Y = Î²â + Î²âX + Îµ that associated with each observation is an error term Îµ. Due to the presence of these error terms, even if we knew the true regression line, i.e. even if Î²â and Î²â were known, we would not be able to perfectly predict Y from X. The RSE is an estimate of the standard deviation of Îµ. Roughly speaking, it is the average amount that the response will deviate from the true regression line. It is computed using the formula

The RSE provides an absolute measure of lack of fit of the model to the data, but since it is measured in the units of Y, itâs not always clear what constitutes a good RSE.

The RÂ² statistic provides an alternative measure of fit. It takes the form of a proportionâthe proportion of variance explainedâand so it always takes on a value between 0 and 1, and is independent of the scale of Y. To calculate RÂ², we use the formula

where TSS = Î£(yáµ¢ â yÌ)Â² is the Total Sum of Squares, and RSS is the Residual Sum of Squares as seen earlier.

TSS measures the total variance in the response Y, and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. Hence, TSS â RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and RÂ² measures the proportion of variability in Y that can be explained using X.

An RÂ² statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. On the other hand a number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance ÏÂ² is high, or both.

In Python, we can compute RÂ² using the sklearn.metrics library and import r2_score.

from sklearn.metrics import r2_score
r2 = r2_score(y, y_pred)
print(round(r2, 3))
>>> 0.956

In our example, the variability in salary seems to be largely explained by a linear regression on professional experience.

Understanding Statistical Learning Part 1 â A Comprehensive Guide to Simple Linear Regression in Python | by Julie Le Rudulier | Apr, 2024

Recent Articles

Can we counter online disinformation?

Unpacking MDPs for Beginners. MDPs Explained by a Beginner for… | by VARDHAN UPPALA | May, 2025

North Korean IT Workers Are Being Exposed on a Massive Scale

Attention May Be All We Need… But Why?

Fortinet Patches CVE-2025-32756 Zero-Day RCE Flaw Exploited in FortiVoice Systems

Related Stories

Leave A Reply Cancel reply

Understanding Statistical Learning Part 1 â A Comprehensive Guide to Simple Linear Regression in Python | by Julie Le Rudulier | Apr, 2024

Recent Articles

Related Stories

Leave A Reply Cancel reply

Understanding Statistical Learning Part 1 â A Comprehensive Guide to Simple Linear Regression in Python | by Julie Le Rudulier | Apr, 2024