Linear regression is a very simple approach for supervised learning and a useful tool for predicting a quantitative response. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches, linear regression is still a useful and widely used statistical learning method, and many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression.
Simple linear regression is a very straightforward approach for predicting a quantitative response Y on the basis of a single predictor variable X. It assumes that there is some relationship between X and Y, which can be written in the very general form
Y = â¨(X) + ε,
where ⨠is some fixed but unknown function of Xâ, . . . , Xâ, and ε is a random error term which is independent of X and has mean 0. Mathematically, we can write this linear relationship as
Y = βâ + βâX + ε,
where βâ and βâ are two unknown constants that represent the intercept term â that is, the expected value of Y when X = 0, and the slope term â the average increase in Y associated with a one-unit increase in X in the linear model, respectively. Together, βâ and βâ are known as the model coefficients or parameters.
Recall that ε is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in Y, and there may be measurement error.
In the example above, the plot displays salary as a function of years of professional experience, where the red line represents a simple model that can be used to predict salary using the years of experience.
While the model given by the equation Y = βâ + βâX + ε defines the population regression line, which is the best linear approximation to the true relationship between X and Y, in real-world applications the population regression line is unobserved, as the true relationship is generally not known for real data. But even though we donât know what the values of βâ and βâ are, we do have access to a set of observations which we can use to estimate our modelâs coefficients.
Suppose (xâ, yâ), (xâ, yâ), â¦, (xâ, yâ) represent n observation pairs, each of which consists of a measurement of X and a measurement of Y. Our goal is to obtain coefficient estimates βËâ and βËâ such that the linear model fits the available data well â that is, so that yáµ¢ â βËâ + βËâxáµ¢ for i = 1, . . . , n.
There are a number of ways of measuring closeness. However, by far the most common approach involves minimizing the least squares criterion.
Imagine Å·áµ¢ = βËâ + βËâxáµ¢ being the prediction for Y based on the iââ value of X. Then eáµ¢ = yáµ¢ â Å·áµ¢ represents the iââ residual, i.e. the difference between the iââ observed response value and the iââ response value that is predicted by our linear model.
Recall from Figure 1 that the red line represents a model to predict salary using years of experience as its predictor. Figure 1 shows the simple least squares fit of salary to that variable.
So for each observation (xâ, yâ), the absolute residual ráµ¢ = |yáµ¢ â Å·áµ¢| quantifies the error â that is, the difference between the true value and the prediction, as shown on Figure 2 below.
The least squares approach chooses βËâ and βËâ to minimize whatâs called the Residual Sum of Squares (RSS). The RSS is a measure of the discrepancy between the data and an estimation model, and it can be defined as
or equivalently as
Using some calculus, we can find the values of the minimizers
where yÌ âand xÌ are the sample means.
These two equations define the least squares coefficient estimates for simple linear regression, and the line defined by the equation
is called the regression line.
The process of determining the coefficient estimates is called fitting, or training the model. Once we have fitted the model, we can use it to predict the value of y at any given x.
The scikit-learn library in Python implements linear regression through the LinearRegression class. This class allows us to fit a linear model to a dataset, predict new values, and evaluate the modelâs performance.
To use the LinearRegression class, we first need to import it from sklearn.linear_model module.
from sklearn.linear_model import LinearRegression
We can then use pandas and the method read_csv
to read the .csv file containing the data and assign it to a data frame, which we will call here âdfâ.
df = pd.read_csv('Salary.csv')
df.head()
We identify our predictor and our response variable and assign them to X and y, respectively.
X = df[['Experience Years']].values
y = df['Salary'].values
We instantiate the model by using linear regression and call it âmodelâ.
model = LinearRegression()
This statement creates the variable model as an instance of LinearRegression.
We then use that model to fit the data. Recall that here we calculate the optimal values of βËâ and βËâ using the existing input and output X and y as arguments. In Python, â.fit()â fits the model and returns âselfâ, which is the variable model itself.
model.fit(X, y)
Once we have our model fitted, we can take a look at our coefficients and interpret them.
model.intercept_
>>> 25673.015760530274model.coef_
>>> array[[9523.65050742]])
As we can see here, βË0 = 25,673.01576 and βË1 = 9,523.65051. According to this approximation, having no professional experience at all is associated with a salary of approximately $25,673, while an additional year of experience is associated with earning approximately $9,524 more.
We can now predict the value of Å· at any given x, using the method .predict().
y_pred = model.predict(X)
When applying this method, we pass the predictor as the argument and get the corresponding predicted response. Another nearly identical way to predict Å· would be
y_pred = model.intercept_ + model.coef_ * X
The output here differs from the previous line of code only in dimensions: The predicted response is now a two-dimensional array, while in the previous case, it had one dimension. If we reduce the number of dimensions of X to one, then these two approaches will yield the same result.
Now we can plot our predictions and visualize how well our model fits our data using the Seaborn and Matplotlib libraries as follows
import seaborn as sns
import matplotlib.pyplot as pltax = sns.scatterplot(data=df[['Experience Years', 'Salary']],
x='Experience Years',
y='Salary',
alpha=0.8)
ax = plt.plot(X, y_pred, color='red')
plt.legend(labels=["Data", "Model"])
plt.show()
This yields the plot displayed as Figure 1.
The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the R² statistic.
Recall from the model Y = βâ + βâX + ε that associated with each observation is an error term ε. Due to the presence of these error terms, even if we knew the true regression line, i.e. even if βâ and βâ were known, we would not be able to perfectly predict Y from X. The RSE is an estimate of the standard deviation of ε. Roughly speaking, it is the average amount that the response will deviate from the true regression line. It is computed using the formula
The RSE provides an absolute measure of lack of fit of the model to the data, but since it is measured in the units of Y, itâs not always clear what constitutes a good RSE.
The R² statistic provides an alternative measure of fit. It takes the form of a proportionâthe proportion of variance explainedâand so it always takes on a value between 0 and 1, and is independent of the scale of Y. To calculate R², we use the formula
where TSS = Σ(yáµ¢ â yÌ)² is the Total Sum of Squares, and RSS is the Residual Sum of Squares as seen earlier.
TSS measures the total variance in the response Y, and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. Hence, TSS â RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and R² measures the proportion of variability in Y that can be explained using X.
An R² statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. On the other hand a number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance ϲ is high, or both.
In Python, we can compute R² using the sklearn.metrics library and import r2_score.
from sklearn.metrics import r2_score
r2 = r2_score(y, y_pred)
print(round(r2, 3))
>>> 0.956
In our example, the variability in salary seems to be largely explained by a linear regression on professional experience.