Python Implementation
Before we start, create an account with Federal Reserve Economic Data (FRED), and get an API key using this link. Please note that this product uses the FRED® API but is not endorsed or certified by the Federal Reserve Bank of St. Louis.
We start with installing and loading the needed modules.
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from pandas.tseries.offsets import MonthEnd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_percentage_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import grangercausalitytestsfrom pmdarima import auto_arima
Next, we’ll create a custom function to read the data using FRED API.
FRED_API_KEY = '__YOUR_API_KEY__'# Function to read data from FRED API
def get_fred_data(data_id, data_name):
response = requests.get(f'https://api.stlouisfed.org/fred/series/observations?series_id={data_id}&api_key={FRED_API_KEY}&file_type=json')
df = pd.DataFrame(response.json()['observations'])[['date', 'value']].rename(columns={'value': data_name})
df[data_name] = pd.to_numeric(df[data_name], errors='coerce')
df['date'] = pd.to_datetime(df['date']) + MonthEnd(1)
df.set_index('date', inplace=True)
df.index.freq='M'
return df
Now, let’s read our data and store it in a pandas dataframe.
dependent_timeseries_id = 'MRTSSM4453USN'
dependent_timeseries_name = 'liquor_sales'potential_leading_indicators = {
'USACPIALLMINMEI': 'consumer_price_index',
'PCU44534453': 'liquor_ppi',
'DSPIC96': 'real_income',
'LFWA25TTUSM647S': 'working_age_population',
'UNRATENSA': 'unemployment_rate',
'LNU01300000': 'labor_force_participation',
'A063RC1': 'social_benefits',
'CCLACBM027NBOG': 'consumer_loans',
}
# Read dependent time series
timeseries = get_fred_data(dependent_timeseries_id, dependent_timeseries_name)# Join timeseries with potential leading indicators
for data_id, data_name in potential_leading_indicators.items():
df = get_fred_data(data_id, data_name)
timeseries = timeseries.join(df)
# We will start our analysis from Jan-2010
timeseries = timeseries['2010':]
# add month we want to predict liquor_sales
timeseries = timeseries.reindex(timeseries.index.union([timeseries.index[-1] + timeseries.index.freq]))
timeseries
Quick visual analysis of our data shows that our dependent time series (liquor sales) more or less follows the same cycle every 12 months. We will use this 12 month cycle as a parameter later on in our time series forecasting.
timeseries[dependent_timeseries_name].plot(figsize=(20,8));
Before we test for causality, we need to confirm the stationarity of our time series data. To achieve this, we will use the Augmented Dickey–Fuller test. If our dataset fails this stationarity test, we must employ a recursive differencing approach until it satisfies the test criteria.
# create a copy of the timeseries to use for tests. Be sure to exclude the additional row we added in the previous task
timeseries_for_gc_tests = timeseries[:-1]
all_cols = timeseries_for_gc_tests.columnsstationary_cols = []
diff_times = 0
while True:
# Test for stationarity
for col in all_cols:
adf, pvalue, lagsused, observations, critical_values, icbest = adfuller(timeseries_for_gc_tests[col])
if pvalue <= 0.05:
stationary_cols.append(col)
# Difference the time series if at least one column fails the stationary test
if set(stationary_cols) != set(all_cols):
timeseries_for_gc_tests = timeseries_for_gc_tests.diff().dropna()
diff_times += 1
stationary_cols = []
else:
print(f'No of Differencing: {diff_times}')
break
Now that we have loaded our time series data into a pandas dataframe, and it passes the stationarity test, we test for causality using the granger causality test.
maxlag = 6 # represents the maximum number of past time periods to look for potential causality. We cap ours at 6 months
leading_indicators = []for x in all_cols[1:]:
gc_res = grangercausalitytests(timeseries_for_gc_tests[[dependent_timeseries_name, x]], maxlag=maxlag, verbose=0)
leading_indicators_tmp = []
for lag in range(1, maxlag+1):
ftest_stat = gc_res[lag][0]['ssr_ftest'][0]
ftest_pvalue = gc_res[lag][0]['ssr_ftest'][1]
if ftest_pvalue <= 0.05:
leading_indicators_tmp.append({'x': x, 'lag': lag, 'ftest_pvalue': ftest_pvalue, 'ftest_stat': ftest_stat, 'xlabel': f'{x}__{lag}_mths_ago'})
if leading_indicators_tmp:
leading_indicators.append(max(leading_indicators_tmp, key=lambda x:x['ftest_stat']))
# Display leading indicators as a dataframe
pd.DataFrame(leading_indicators).reset_index(drop=True).reset_index(drop=True)
From our tests, we see can see that liquor sales of the current month is affected by Consumer Price Indexᵈ² and Consumer Loansᵈ¹⁰ of 2 months ago; and Labour Force Participationᵈ⁷ of 6 months ago.
Now that we have established our leading indicators, we will shift their records so that their lagged figures are in the same row as the current data of liquor_sales which they “cause”.
# shift the leading indicators by their corresponding lag periods
for i in leading_indicators:
timeseries[i['xlabel']] = timeseries[i['x']].shift(periods=i['lag'], freq='M')# select only the dependent_timeseries_name and leading indicators for further analysis
timeseries = timeseries[[dependent_timeseries_name, *[i['xlabel'] for i in leading_indicators]]].dropna(subset=[i['xlabel'] for i in leading_indicators], axis=0)
timeseries
Next, we scale our data so that all features are within the same range, then apply PCA to eliminate multicollinearity between our leading indicators.
# Scale dependent timeseries
y_scaler = StandardScaler()
dependent_timeseries_scaled = y_scaler.fit_transform(timeseries[[dependent_timeseries_name]])# Scale leading indicators
X_scaler = StandardScaler()
leading_indicators_scaled = X_scaler.fit_transform(timeseries[[i['xlabel'] for i in leading_indicators]])
# Reduce dimensionality of the leading indicators
pca = PCA(n_components=0.90)
leading_indicators_scaled_components = pca.fit_transform(leading_indicators_scaled)leading_indicators_scaled_components.shape
Finally, we can build our SARIMAX model with the help of auto_arima. For the purpose of this implementation, we will leave all parameters as their default, with the exception of seasonality flag and the number of periods in each cycle (m).
We will train our model using the timeseries data up until ‘2024–05–31’, test with ‘2024–06–30’ data, then predict the ‘2024–07–31’ liquor sales.
# Build SARIMAX model
periods_in_cycle = 12 # number of periods per cycle. In our case, its 12 months
model = auto_arima(y=dependent_timeseries_scaled[:-2], X=leading_indicators_scaled_components[:-2], seasonal=True, m=periods_in_cycle)
model.summary()
# Forecast the next two periods
preds_scaled = model.predict(n_periods=2, X=leading_indicators_scaled_components[-2:])
pred_2024_06_30, pred_2024_07_31 = np.round(y_scaler.inverse_transform([preds_scaled]))[0]print("TEST\n----")
print(f"Actual Liquor Sales for 2024-06-30: {timeseries[dependent_timeseries_name]['2024-06-30']}")
print(f"Predicted Liquor Sales for 2024-06-30: {pred_2024_06_30}")
print(f"MAPE: {mean_absolute_percentage_error([timeseries[dependent_timeseries_name]['2024-06-30']], [pred_2024_06_30]):.1%}")
print("\nFORECAST\n--------")
print(f"Forecasted Liquor Sales for 2024-07-31: {pred_2024_07_31}")
By following the process step-by-step, we forecasted the liquor sales figure for July 2024 with an estimated MAPE of just 0.4%.
To further improve the accuracy of our prediction, we can explore adding more potential leading indicators and finetuning the models used.
Conclusion
Leading indicators, as we’ve explored, serve as early signals of future trends, providing a crucial edge in anticipating changes before they fully materialise. By leveraging techniques such as Granger causality tests to identify leading indicator series and incorporating them into forecasting models, we can significantly enhance the precision and robustness of our predictions.