How to do exploratory data analysis of a time series
6 hours ago
One of the most popular types of data is a time series. Videos, images, pixels, signals, literally anything having a time component could be turned into it. Formally, a time series is a sequence of historical measurements of an observable variable at equal time intervals.
In this article I want to suggest a small pipeline which anyone can use when analyzing a time series. It could help you to derive meaningful insights about the data itself, prepare it for modeling and draw some preliminary conclusions.
Since my favorite word is geospatial ð, today we will analyze a meteorological time series. In particular, we will explore 2 m air temperature, total precipitation, surface net solar radiation and surface pressure at the point in South-Eastern Siberia over 2023 derived from hourly ERA5 Land [1] climate reanalysis.
As always, to follow up the tutorial, you can download and run the notebook here.
To accomplish the analysis we need to import several libraries:
import pandas as pd
import seaborn as sns
import numpy as npimport matplotlib.pyplot as plt
import xarray as xr
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from scipy import stats
I also decided to try a new matplotlib style from two libraries, opinionated and ambivalent:
from ambivalent import STYLES
import opinionated
plt.style.use(STYLES['ambivalent'])
plt.style.use("dark_background")
First of all, letâs upload and visualize the data we have. To handle geospatial multidimensional arrays we will use xarray library.
data = xr.open_dataset('Medium_data.nc')
data
Now we need to slice the data specifically for the chosen location and convert to a pandas dataframe and create a line plot:
df = data.sel(latitude=52.53, longitude=101.63, method='pad').to_pandas().drop(['latitude', 'longitude'], axis=1)
fig, ax = plt.subplots(ncols = 2, nrows = 2, figsize=(16,9))
df['t2m'].plot(ax=ax[0,0])
ax[0,0].set_title('Air Temperature')
df['ssr'].plot(ax=ax[0,1])
ax[0,1].set_title('Surface Net Solar Radiation')
df['sp'].plot(ax=ax[1,0])
ax[1,0].set_title('Surface Pressure')
df['tp'].plot(ax=ax[1,1])
ax[1,1].set_title('Total Precipitation')
plt.tight_layout()
plt.show()
Itâs already clear from the line plots that all the four time series have different features, so letâs investigate them using mathematical tools.
Any time series has three important attributes to consider:
- Trend, which is a smooth long-term change in a time series;
- Seasonality, which is referred to a time series with a regular periodic change in the mean of the series;
- Noise (residuals), which is a random component of a signal with a mean equal to zero.
To get each of these components separately, itâs possible to produce classical decomposition (additive or multiplicative). This operation is produced by applying a convolutional filter, so then each time series component is defined as either
or
where y â is a value from a time series, S â seasonal components, T â trend component and n â noise.
To produce decomposition, besides selecting the decomposition type, you need to set a seasonal period (e.g. p=1 for annual, p=4 for quarterly, p=12 for monthly data etc.).
Itâs important to mention that the aforementioned classical decomposition is a really naive and simple methods. It has significant limitations such as its linearity, inability to capture dynamic seasonality and difficulty in handling non-stationarity in a time series. However, for the purpose of this article this method is more than enough.
To do the classical decomposition we will use seasonal_decompose function from statsmodels library with a period equal 24, since we are dealing with hourly data:
vars = {'t2m': 'Air Temperature', 'tp': 'Total Precipitation', 'sp': 'Surface Pressure', 'ssr': 'Surface Net Solar Radiation'}
for var in df.columns:
result = sm.tsa.seasonal_decompose(df[var], model='additive', period = 24)
results_df = pd.DataFrame({'trend': result.trend, 'seasonal': result.seasonal, 'resid': result.resid, 'observed': result.observed})
fig, ax = plt.subplots(ncols = 2, nrows = 2,figsize=(16,9))
ax[0,0].plot(df.index, results_df.trend)
ax[0,0].set_title('Trend')
ax[0,0].set_ylabel('Value')ax[0,1].plot(df.index, results_df.seasonal)
ax[0,1].set_title('Seasonal')
ax[1,0].plot(df.index, results_df.resid)
ax[1,0].set_title('Residual')
ax[1,0].set_ylabel('Value')
ax[1,0].set_xlabel('time')
ax[1,1].plot(df.index, results_df.observed)
ax[1,1].set_title('Observed')
ax[1,1].set_xlabel('time')
opinionated.set_title_and_suptitle(vars[var], f"Dickey-Fuller test: {round(sm.tsa.stattools.adfuller(df[var])[1],5)}", position_title=[0.45,1],
position_sub_title=[0.95, 1])
plt.tight_layout()
plt.savefig(f'Seasonal_{var}.png')
plt.show()
You can see that for all the variables seasonal component looks like a mess. Since we analyze hourly data, these seasonal variations are observed within one day, and not really informative. In this case, itâs worth trying to resample the data to daily resolution and do the decomposition for the period of one day.
df_d = df.resample('1d').mean()
By now some of you have probably noticed the Dickey-Fuller test label in the upper-right corner of the plot. This is a stationarity test, which was done using the adfuller function of the same library. In case of a time series, stationarity means that the properties of a time series do not change over time. Saying properties, I mean variance, seasonality, trend and autocorrelation.
When applying the Augmented Dickey-Fuller (ADF) test to a time series we pose a null-hypothesis that the time series is non-stationary. Then we select a significance level α, which is usually 5%. In essence, α is the probability of incorrectly rejecting the null hypothesis when it is actually true. So in our case, α=5%, there is 5% risk of concluding that the time series is stationary, when itâs actually not.
The test result will give us a p-value. If itâs lower than 0.05, we can reject our null-hypothesis.
As you can see, all 4 variables are stationary according to the ADF test.
Usually, to apply some time series forecasting models such as ARIMA and others, stationarity is a must-have, so we are lucky here. Generally, meteorological and climatic data are often analyzed in different time-series related learning materials, since they are stationary in most cases.
After concluding that all our time series are stationary, letâs have a look at how they are distributed. To do that we will use the well-known seaborn library and its function pairplot, which allows to create informative plots with hists and kdes.
ax = sns.pairplot(df, diag_kind='kde')
ax.map_upper(sns.histplot, bins=20)
ax.map_lower(sns.kdeplot, levels=5, color='.1')
plt.show()
Letâs consider the example of t2m (1 row and 1 column). When analyzing the kernel density estimation (kde) plot itâs clear that the distribution of this variable is multimodal, meaning it has 2 or more âbellsâ. So during the following stages of the current article we will try to transform the variable to resemble a normal distribution.
Other plots in the 1st column and row are identical in terms of the information they provide, but they are visualized differently. Basically, these are scatter plots, which allow to identify how two variables are correlated. So the darker the color of a point or the closer a point to the central circle, the higher the density of points in this area.
Since we have discovered that the air temperature time series is stationary, but not normally distributed, letâs try to fix that using Box-Cox transformation. To do that we will use scipy package and its function boxcox.
df_d['t2m_box'], _ = stats.boxcox(df_d.t2m)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,7))
sns.histplot(df_d.t2m_box, kde=True, ax=ax[0])
sns.histplot(df_d.t2m, kde=True, ax=ax[1])
The left part of the plot is our time series distribution after BoxCox transformation, and as you can see, itâs still far from being called ânormallyâ distributed. But if we compare it with the one on the right, we can say it definitely got closer to a normal one.
Another thing we can do to make sure that the performed transformation was useful is to create a probability plot. In essence, we plot quantiles of a theoretical distribution (in our case normal) against samples of our empirical data (i.e. the time series we consider). The closer the points to the white line, the better.
fig = plt.figure()ax1 = fig.add_subplot(211)
prob = stats.probplot(df_d.t2m, dist=stats.norm, plot=ax1)
ax1.get_lines()[1].set_color('w')
ax1.get_lines()[0].set_color('#8dd3c7')
ax1.set_title('Probplot against normal distribution')
ax2 = fig.add_subplot(212)
prob = stats.probplot(df_d.t2m_box, dist=stats.norm, plot=ax2)
ax2.get_lines()[1].set_color('w')
ax2.get_lines()[0].set_color('#8dd3c7')
ax2.set_title('Probplot after Box-Cox transformation')
plt.tight_layout()fig = plt.figure()
ax1 = fig.add_subplot(211)
prob = stats.probplot(df_d.t2m, dist=stats.norm, plot=ax1)
ax1.set_title('Probplot against normal distribution')
ax2 = fig.add_subplot(212)
prob = stats.probplot(df_d.t2m_box, dist=stats.norm, plot=ax2)
ax2.set_title('Probplot after Box-Cox transformation')
plt.tight_layout()
If you are going to use the transformed time series for ML modeling, donât forget to apply reverse BoxCox transformation, otherwise youâll have to deal with inadequate numbers!
And the last step in our analysis is autocorrelation. Autocorrelation function(ACF) estimates correlation between a time series and a lagged version of it. Or in other words, how a specific value of a time series correlates with other prior values in different time intervals.
It might also be helpful to plot partial autocorrelation function (PACF), which is the same as autocorrelation, but with correlation at shorter lags removed. So it estimates the correlation between values within a certain timestamp, but controlling the influence of other values.
for var in df.columns[:-1]:
fig, (ax1, ax2) = plt.subplots(2,1,figsize=(10,8))
plot_acf(df_d.t2m, ax = ax1)
plot_pacf(df_d.t2m, ax = ax2)
opinionated.set_title_and_suptitle(vars[var], '',position_title=[0.38,1],
position_sub_title=[0.95, 1])
plt.tight_layout()
plt.show()
As you can see there is a really strong partial autocorrelation in the surface pressure time series with 1 day lag. Then it weakens significantly and after the 3 day lag itâs almost absent. Such an analysis might help you to better understand the nature of the data you are dealing with, and hence, derive more meaningful conclusions.