Predicting Time-Series with SARIMAX | by Gianpiero Andrenacci | Data Bistrot | Sep, 2024


The data is loaded into a pandas dataframe and the relevant columns are selected and renamed for clarity. Missing values are checked and handled appropriately. The dataset is then prepared for further analysis by filtering the necessary features and creating new date-related columns using the custom functions defined earlier.

Dataset link: dpc-covid19-ita-regioni.csv

1.4 Dataset Description

The dataset used in this notebook comprises daily records of Covid-19 new cases and other related metrics across different regions in Italy. Key features include:

  • date: The date of the record.
  • new_cases: The number of new Covid-19 cases reported on that date.
  • region: The name of the region in Italy where the cases were reported.
  • holiday: A binary indicator of whether the date is a public holiday in Italy.

The data spans from the initial outbreak in February 2020 and is updated regularly to provide the most current information. This dataset allows for detailed time series analysis and forecasting, which is crucial for understanding the trends and predicting future cases.

By understanding these foundational steps, we set the stage for deeper exploration and modeling of the data, which includes data preparation, exploratory data analysis (EDA), and building and evaluating the SARIMAX model.

Data preparation and exploratory data analysis (EDA) are essential steps in any data science project. They help in understanding the data, identifying patterns, and preparing it for modeling. In this section, we will discuss how data is prepared and explored in the notebook.

2.1 Data Preparation

Data preparation involves several steps to ensure the dataset is ready for analysis and modeling. Here are the main steps taken in this notebook:

  • Aggregation and Grouping: The dataset is aggregated to get the total number of new cases per day. This involves summing up the new cases across all regions for each date.
  • Feature Creation: Additional features are created using the custom create_features function. These features include day of the week, month, year, day of the year, and other time-related attributes. These features help in capturing the temporal patterns in the data.
  • Handling Missing Values: The dataset is checked for any missing values. If any are found, appropriate measures are taken to handle them, such as imputation or removal, to ensure the dataset is complete and ready for analysis.
  • Setting the Index: The date column is set as the index of the dataframe. This is important for time series analysis as it allows for easy manipulation and plotting of the data over time.

2.2 EDA with Pandas-Profiling

Exploratory Data Analysis (EDA) is performed to understand the data better and identify any patterns, trends, or anomalies. In this notebook, pandas-profiling is used for EDA:

  • Generating a Report: The pandas-profiling library generates a detailed report of the dataframe. This report includes information about the number of rows and columns, data types, missing values, and the distribution of values in each column.
  • Statistical Measures: The report provides statistical measures such as mean, median, and standard deviation for numeric columns. It also includes plots of the distribution of each column, helping to visualize the data.
  • Insights and Patterns: The report helps in identifying any patterns, trends, or anomalies in the data. For example, it may highlight periods with unusually high or low new cases, or identify columns with a significant number of missing values.

Pandas Profiling (ydata-profiling) in Python: A Guide for Beginners

2.3 Geographic Data Analysis

Geographic data analysis is conducted to understand the spatial distribution of Covid-19 cases across different regions in Italy. This involves:

  • Merging Data: The Covid-19 dataset is merged with geographic shapefiles of Italian regions. These shapefiles are obtained from the Italian National Statistic Institute (ISTAT) and provide the geographic boundaries of each region.
  • Mapping: The merged data is used to create maps that visualize the number of new cases in each region. These maps help in identifying geographic patterns and regions with higher or lower numbers of cases.
  • Temporal Analysis: The geographic data is analyzed over time to see how the distribution of cases changes. This can highlight regions that have become hotspots or show improvements over time.

2.4 Exploratory Data Analysis

Further exploratory data analysis is conducted to gain deeper insights into the data:

  • Time Series Plots: The number of new cases is plotted over time to visualize trends and patterns. This helps in understanding the overall trajectory of the pandemic and identifying any seasonal patterns.
  • Heatmaps: Heatmaps are used to visualize the distribution of new cases on different weekdays and months. This can highlight days or periods with higher numbers of cases.
  • Pareto Analysis: A Pareto diagram is created to apply the 80–20 rule, identifying the most significant time periods or categories that account for the majority of new cases.
  • Distribution Plot: The distribution plot in the context of the provided analysis shows the distribution of new COVID-19 cases in Italy over the last three months. This type of plot helps visualize the frequency and spread of new case counts, providing insights into the underlying patterns of the data. In simple terms, the plot helps to see how often different numbers of new COVID-19 cases were reported each day over the last three months in Italy. It visually summarizes the spread and frequency of the new case counts.

These steps ensure that the data is well-understood and ready for the next phase of building and evaluating the SARIMAX model. By performing thorough data preparation and exploration, the notebook sets a solid foundation for accurate and reliable time series forecasting.

In this section, we will discuss how to build the SARIMAX model for predicting Covid-19 new cases in Italy. The SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors) model is a powerful tool for time series forecasting, especially when there are seasonal patterns and external factors influencing the data.

3.1 Data Preparation for SARIMAX

Before building the SARIMAX model, it is essential to prepare the data appropriately. This involves the following steps:

  • Creating Endogenous and Exogenous Variables: The target variable, which is the number of new Covid-19 cases, is defined as the endogenous variable. Additionally, exogenous variables such as holidays are included to account for external factors that might influence the number of new cases. These variables help the model to better capture the underlying patterns in the data.
  • Splitting the Data: The data is split into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance. Typically, the most recent data points are reserved for testing to assess how well the model can predict future values.
  • Setting the Frequency: The date index is set with a daily frequency to ensure the model treats the data as a daily time series. This is important for capturing the temporal structure and seasonality in the data.

3.2 SARIMA ACF/PACF

The SARIMA model is an extension of the ARIMA model that includes seasonal components. To determine the appropriate parameters for the SARIMA model, we use the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots:

  • ACF Plot: The ACF plot shows the correlation of the time series with its own lagged values. It helps to identify the order of the Moving Average (MA) component. If the ACF plot shows a gradual decline, it suggests the need for differencing. If it shows a sharp cut-off, it indicates the order of the MA component.
  • PACF Plot: The PACF plot shows the partial correlation of the time series with its own lagged values, after controlling for the correlations at shorter lags. It helps to identify the order of the AutoRegressive (AR) component. If the PACF plot shows a sharp cut-off, it indicates the order of the AR component.
  • Seasonal Decomposition: The time series is decomposed into trend, seasonal, and residual components. This helps to visualize the underlying patterns and confirm the presence of seasonality, which is crucial for choosing the seasonal parameters of the SARIMA model.

By analyzing the ACF and PACF plots, along with the seasonal decomposition, we can make informed decisions about the parameters of the SARIMA model.

3.3 Find Optimal Hyperparameters

Finding the optimal hyperparameters for the SARIMAX model is a critical step to ensure the model’s accuracy and reliability. This involves selecting the values for the following parameters:

  • p: The order of the autoregressive component (AR).
  • d: The order of differencing needed to make the series stationary.
  • q: The order of the moving average component (MA).
  • P: The order of the seasonal autoregressive component (SAR).
  • D: The order of seasonal differencing.
  • Q: The order of the seasonal moving average component (SMA).
  • m: The number of periods in each season (e.g., 7 for weekly seasonality in daily data).

To automate the process of selecting these hyperparameters, the auto_arima function from the pmdarima library is used. This function performs a grid search over a range of possible values for the hyperparameters and selects the combination that minimizes the model’s error metrics, such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC).

Using auto_arima, we can efficiently find the best set of hyperparameters for the SARIMAX model, ensuring that it captures the essential patterns in the data while avoiding overfitting.

By following these steps, we build a robust SARIMAX model capable of accurately predicting the future number of Covid-19 new cases in Italy. The prepared data, the analysis of ACF/PACF plots, and the optimized hyperparameters all contribute to the model’s effectiveness and reliability.

Once the SARIMAX model is built and trained, the next steps are to use the model for making predictions and to evaluate its performance. This section will cover the conceptual aspects of these steps as demonstrated in the Kaggle notebook.

4.1 Predict

With the SARIMAX model trained on historical data, we proceed to make predictions for future values of new Covid-19 cases. The process involves:

  • Forecasting Future Values: The model generates forecasts for a specified number of future time periods. For instance, in this notebook, the model is used to predict new cases for the next seven days. The model takes into account the historical data, seasonal patterns, and exogenous variables (such as holidays) to make these predictions.
  • Generating Forecasts with Exogenous Variables: If the model includes exogenous variables, future values of these variables must also be provided to generate accurate forecasts. This ensures that the model considers all relevant factors affecting the time series during the prediction period.
  • Visualization of Predictions: The predicted values are typically visualized alongside the actual values to provide a clear comparison. This helps in assessing how well the model is performing and whether the predictions align closely with the observed data. Visualization can include line plots showing the predicted and actual values over time.

4.2 Evaluate the Model

Evaluating the model’s performance is a critical step to understand its accuracy and reliability. This involves several key metrics and techniques:

  • Error Metrics: Various error metrics are used to quantify the model’s performance. Common metrics include:
  • Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values. Lower values indicate better model performance.
  • Root Mean Squared Error (RMSE): The square root of MSE, providing error in the same units as the data.
  • Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual values.
  • Mean Absolute Percentage Error (MAPE): Provides error as a percentage of the actual values, useful for comparing performance across different scales.
  • R-squared (R²): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. Higher values indicate better model performance.
  • Residual Analysis: Examining the residuals (differences between actual and predicted values) helps in diagnosing any patterns or biases in the model. Ideally, residuals should be randomly distributed with no discernible patterns, indicating a well-fitted model.
  • Visualization of Actual vs. Predicted Values: Plotting the actual values against the predicted values provides a visual assessment of the model’s accuracy. This can highlight periods where the model performs well and periods where it may struggle.
  • Model Diagnostics: Additional diagnostic plots can be used to assess the model’s assumptions and performance. These plots might include:
  • Residual Plots: To check for homoscedasticity (constant variance of residuals).
  • Q-Q Plots: To check if residuals follow a normal distribution.
  • Autocorrelation Plots of Residuals: To ensure residuals are not autocorrelated.

By thoroughly evaluating the model using these metrics and techniques, we can gain confidence in its predictions and identify areas for potential improvement. This comprehensive approach ensures that the SARIMAX model is robust and reliable for forecasting Covid-19 new cases in Italy.

This completes the conceptual explanation of the notebook for predicting Covid-19 new cases using SARIMAX. For detailed code and implementation, please refer to the Kaggle notebook itself.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here