Over the past few years, there has been significant growth in population, technological advancements, and the usage of fossil fuels, all of which have contributed to the deterioration of the air quality index (AQI). Urban areas with high population density, such as New York City and Los Angeles, are more susceptible to health issues caused by inadequate air quality. This article will utilize Python and machine learning to develop and apply a linear regression model that can forecast AQI based on the five pollutants identified by the Environmental Protection Agency (EPA): ground-level ozone, particle pollution (PM2.3 and PM10), carbon monoxide, sulfur dioxide, and nitrogen dioxide. A model of this nature would provide government authorities and scientists the capacity to implement air purification strategies to uphold a specific Air Quality Index.
We’ll begin with accessing a fairly complete dataset from Kaggle; the dataset contains the following features:
· PM2.5: The concentration of fine particulate matter with a diameter of less than 2.5 micrometers (µg/m³).
· PM10: The concentration of particulate matter with a diameter of less than 10 micrometers (µg/m³).
· NO2: The concentration of nitrogen dioxide (µg/m³).
· SO2: The concentration of sulfur dioxide (µg/m³).
· CO: The concentration of carbon monoxide (mg/m³).
· O3: The concentration of ozone (µg/m³).
· Temperature: The temperature at the time of measurement (°C).
· Humidity: The humidity level at the time of measurement (%).
· Wind Speed: The wind speed at the time of measurement (m/s)
as we can see the dataset is fairly complete with the exception of AQI being calculated.
Before we jump directly into programming, it is with the assumption that users understand the Python language and terminology that will be discussed this point on. Let’s start with importing the necessary libraries to complete our task of creating a linear regression model.
Once libraries are imported successfully, we’ll begin with storing our CSV to a pandas data frame and checking for NaN values by using a combination of df.isna().sum() methods. As we can see from the output our dataset is free of NaN values. It is critical that NaN values are assessed and resolved before moving forward with any machine learning models to ensure accurate models.
As mentioned, the dataset does not contain our target, AQI therefore we must calculate this value using Python functions alongside pollutant breakpoints and AQI formula provided by the EPA. Once all breakpoints have been defined we’ll calculate the AQI for each pollutant using the lambda function append those outputs to our data frame and finally calculated and append the overall AQI.
It is best practice to call the head function after such computations have been made to ensure the expected results are shown. We can see that the function works as expected, in addition we have created a few NaN values in the process; for our use case this isn’t an issue due to those columns being dropped in the next phase of creating our model.
Moving forward in our preprocessing we’ll create a list of columns referencing their indices that we intend to drop and create a copy of our data frame using the df.drop and .copy() methods in conjunction; creating a copy ensures that data quality and integrity are upheld for analysis if we choose to do so in the future. Our data frame now has the following features:
· Country
· PM2.5
· PM10
· NO2
· SO2
· CO
· 03
· Temperature
· Humidity
· Wind speed
· AQI
In order create a linear regression model our dataset must be split by features and target, in our use case the target is AQI; the value we want to predict using the features(variables) which is everything else.
x = df_copy[[‘PM2.5’, ‘PM10’, ‘NO2’, ‘SO2’, ‘CO’, ‘O3’, ‘Temperature’, ‘Humidity’, ‘Wind Speed’]]
y = df_copy[[‘AQI’]]
Once values have been assigned we can begin training and testing our model using the train_test_split function which uses our split data based on a specified percentage; we’ll use 33% and the random state argument to help eliminate bias.
X_train, X_test,y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
For our model evaluation we’ll use R-squared and Root mean squared error to determine accuracy:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)
The R-squared for our model is: 0.8963525116405926
The Root mean squared error: 12.062094559294078
The plot below visualizes the relationship between actual and predicted AQI values. Each blue point represents an actual vs. predicted AQI pair. The red diagonal line helps assess the accuracy of the predictions: the closer the points are to this line, the more accurate the predictions. If the points fall exactly on the line, it means the predictions are perfect.
Summary
R² represents the proportion of variance in the target variable (AQI) that is explained by the features used in the model. An R² value of 0.896 means that approximately 89.6% of the variability in AQI is explained by the features in our model. A value this high, indicates that our model has a strong explanatory power and fits the data well.
Root mean squared error (RMSE) measures the average magnitude of the prediction errors. An RMSE of 12.06 means that, on average, the predictions of AQI are off by approximately 12.06 units from the actual AQI values. It’s also good to note that AQI values range from 0–500.
High R² with Moderate RMSE: A high R² combined with an RMSE of 12.06 generally suggests that the model is performing well. The high R² indicates strong predictive power, while the RMSE gives a tangible measure of how far off our model predictions are, on average.