Linear regression is a very simple approach for *supervised learning* and a useful tool for predicting a *quantitative response*. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches, linear regression is still a useful and widely used statistical learning method, and many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression.

Simple linear regression is a very straightforward approach for predicting a quantitative response *Y* on the basis of a single predictor variable *X*. It assumes that there is some relationship between *X* and *Y, *which can be written in the very general form

*Y = â¨(X) + Îµ,*

where *â¨ *is some fixed but unknown function of *Xâ, . . . , X*â,* *and *Îµ* is a random *error term* which is independent of X and has mean 0. Mathematically, we can write this linear relationship as

*Y = Î²â + Î²âX + Îµ,*

where *Î²â* and *Î²â* are two unknown constants that represent the *intercept *term â that is, the expected value of *Y* when *X* = 0,* *and the *slope *term â the average increase in *Y* associated with a one-unit increase in *X* in the linear model, respectively. Together, *Î²â* and *Î²â* are known as the model *coefficients *or *parameters*.

Recall that *Îµ *is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in *Y*, and there may be measurement error.

In the example above, the plot displays salary as a function of years of professional experience, where the red line represents a simple model that can be used to predict salary using the years of experience.

While the model given by the equation *Y = Î²â + Î²âX + Îµ *defines the *population regression line*, which is the best linear approximation to the true relationship between *X* and *Y*, in real-world applications the *population regression line* is unobserved, as the true relationship is generally not known for real data. But even though we donât know what the values of *Î²â *and* Î²â* are, we do have access to a set of observations which we can use to estimate our modelâs coefficients.

Suppose *(xâ, yâ), (xâ, yâ), â¦, (x*â*, y*â*)* represent *n* observation pairs, each of which consists of a measurement of *X* and a measurement of *Y*. Our goal is to obtain coefficient estimates *Î²Ëâ *and* Î²Ëâ* such that the linear model fits the available data well â that is, so that *yáµ¢ â Î²Ëâ *+* Î²Ëâxáµ¢ *for *i = 1, . . . , n.*

There are a number of ways of measuring *closeness*. However, by far the most common approach involves minimizing the *least squares *criterion.

Imagine *Å·áµ¢ = Î²Ëâ *+* Î²Ëâxáµ¢* being the prediction for *Y* based on the *i*ââ value of *X*. Then *eáµ¢ = yáµ¢ â Å·áµ¢* represents the *i*ââ *residual*, i.e. the difference between the *i*ââ observed response value and the *i*ââ response value that is predicted by our linear model.

Recall from **Figure 1** that the red line represents a model to predict salary using years of experience as its predictor. Figure 1 shows the *simple least squares fit* of salary to that variable.

So for each observation *(x*â*, y*â*)*, the absolute residual *ráµ¢ = |yáµ¢ â Å·áµ¢| *quantifies the error â that is, the difference between the true value and the prediction, as shown on **Figure 2 **below.

The *least squares* approach chooses *Î²Ëâ *and* Î²Ëâ* to minimize whatâs called the *Residual Sum of Squares* (RSS). The RSS is a measure of the discrepancy between the data and an estimation model, and it can be defined as

or equivalently as

Using some calculus, we can find the values of the minimizers

where *yÌ* âand *xÌ* are the sample means.

These two equations define the *least squares coefficient estimates *for simple linear regression, and the line defined by the equation

is called the *regression line*.

The process of determining the coefficient estimates is called **fitting**, or **training** the model. Once we have fitted the model, we can use it to predict the value of *y* at any given *x*.

The **scikit-learn library** in Python implements linear regression through the **LinearRegression class**. This class allows us to fit a linear model to a dataset, predict new values, and evaluate the modelâs performance.

To use the LinearRegression class, we first need to import it from sklearn.linear_model module.

`from sklearn.linear_model import LinearRegression`

We can then use pandas and the method `read_csv`

to read the .csv file containing the data and assign it to a data frame, which we will call here âdfâ.

`df = pd.read_csv('Salary.csv')`

df.head()

We identify our predictor and our response variable and assign them to *X* and *y*, respectively.

`X = df[['Experience Years']].values `

y = df['Salary'].values

We instantiate the model by using linear regression and call it âmodelâ.

`model = LinearRegression()`

This statement creates the variable model as an instance of LinearRegression.

We then use that model to fit the data. Recall that here we calculate the optimal values of *Î²Ëâ *and* Î²Ëâ* using the existing input and output *X* and *y *as arguments. In Python, â.fit()â fits the model and returns âselfâ, which is the variable model itself.

`model.fit(X, y)`

Once we have our model fitted, we can take a look at our coefficients and interpret them.

`model.intercept_`

>>> 25673.015760530274model.coef_

>>> array[[9523.65050742]])

As we can see here, Î²Ë0 = 25,673.01576 and Î²Ë1 = 9,523.65051. According to this approximation, having no professional experience at all is associated with a salary of approximately $25,673, while an additional year of experience is associated with earning approximately $9,524 more.

We can now predict the value of *Å·* at any given *x*, using the method .predict().

`y_pred = model.predict(X)`

When applying this method, we pass the predictor as the argument and get the corresponding predicted response. Another nearly identical way to predict *Å·* would be

`y_pred = model.intercept_ + model.coef_ * X`

The output here differs from the previous line of code only in dimensions: The predicted response is now a two-dimensional array, while in the previous case, it had one dimension. If we reduce the number of dimensions of *X *to one, then these two approaches will yield the same result.

Now we can plot our predictions and visualize how well our model fits our data using the Seaborn and Matplotlib libraries as follows

`import seaborn as sns`

import matplotlib.pyplot as pltax = sns.scatterplot(data=df[['Experience Years', 'Salary']],

x='Experience Years',

y='Salary',

alpha=0.8)

ax = plt.plot(X, y_pred, color='red')

plt.legend(labels=["Data", "Model"])

plt.show()

This yields the plot displayed as Figure 1.

The quality of a linear regression fit is typically assessed using two related quantities: the *residual standard error *(RSE) and the *RÂ²* statistic.

Recall from the model *Y = Î²â + Î²âX + Îµ *that associated with each observation is an error term *Îµ*. Due to the presence of these error terms, even if we knew the true regression line, i.e. even if *Î²â *and* Î²â* were known, we would not be able to perfectly predict *Y* from *X*. The RSE is an estimate of the standard deviation of *Îµ*. Roughly speaking, it is the average amount that the response will deviate from the true regression line. It is computed using the formula

The RSE provides an absolute measure of lack of fit of the model to the data, but since it is measured in the units of *Y*, itâs not always clear what constitutes a good RSE.

The *RÂ²* statistic provides an alternative measure of fit. It takes the form of a *proportion*âthe proportion of variance explainedâand so it always takes on a value between 0 and 1, and is independent of the scale of *Y*. To calculate *RÂ²*, we use the formula

where TSS = Î£(*yáµ¢ â yÌ*)Â² is the *Total Sum of Squares*, and RSS is the *Residual Sum of Squares* as seen earlier*.*

TSS measures the total variance in the response *Y*, and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. Hence, TSS â RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and *RÂ²* measures the proportion of variability in* Y* that can be explained using* X*.

An *RÂ²* statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. On the other hand a number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance *ÏÂ²* is high, or both.

In Python, we can compute *RÂ² *using the **sklearn.metrics** library and import **r2_score**.

`from sklearn.metrics import r2_score`

r2 = r2_score(y, y_pred)

print(round(r2, 3))

>>> 0.956

In our example, the variability in salary seems to be largely explained by a linear regression on professional experience.