Time series and more specifically time series forecasting is a very well known data science problem among professionals and business users alike.
Several forecasting methods exist, which may be grouped as statistical or machine learning methods for comprehension and a better overview, but as a matter of fact, the demand for forecasting is so high that the available options are abundant.
Machine learning methods are considered state-of-the-art approach in time series forecasting and are increasing in popularity, due to the fact that they are able to capture complex non-linear relationships within the data and generally yield higher accuracy in forecasting [1]. One popular machine learning field is the landscape of neural networks. Specifically for time series analysis, recurrent neural networks have been developed and applied to solve forecasting problems [2].
Data science enthusiasts might find the complexity behind such models intimidating and being one of you I can tell that I share that feeling. However, this article aims to show that
despite the latest developments in machine learning methods, it is not necessarily worth pursuing the most complex application when looking for a solution for a particular problem. Well-established methods enhanced with powerful feature engineering techniques could still provide satisfactory results.
More specifically, I apply a Multi-Layer Perceptron model and share the code and results, so you can get a hands-on experience on engineering time series features and forecasting effectively.
More precisely what I aim at to provide for fellow self-taught professionals, could be summarized in the following points:
- forecasting based on real-world problem / data
- how to engineer time series features for capturing temporal patterns
- build an MLP model capable of utilizing mixed variables: floats and integers (treated as categoricals via embedding)
- use MLP for point forecasting
- use MLP for multi-step forecasting
- assess feature importance using permutation feature importance method
- retrain model for a subset of grouped features (multiple groups, trained for individual groups) to refine the feature importance of grouped features
- evaluate the model by comparing to an
UnobservedComponents
model
Please note, that this article assumes the prior knowledge of some key technical terms and do not intend to explain them in details. Find those key terms below, with references provided, which could be checked for clarity:
- Time Series [3]
- Prediction [4] — in this context it will be used to distinguish model outputs in the training period
- Forecast [4] — in this context it will be used to distinguish model outputs in the test period
- Feature Engineering [5]
- Autocorrelation [6]
- Partial Autocorrelation [6]
- MLP (Multi-Layer Perceptron) [7]
- Input Layer [7]
- Hidden Layer [7]
- Output Layer [7]
- Embedding [8]
- State Space Models [9]
- Unobserved Components Model [9]
- RMSE (Root Mean Squared Error) [10]
- Feature Importance [11]
- Permutation Feature Importance [11]
The essential packages used during the analysis are numpy
and pandas
for data manipulation, plotly
for interactive charts, statsmodels
for statistics and state space modeling and finally, tensorflow
for MLP architcture.
Note: due to technical limitations, I will provide the code snippets for interactive plotting, but the figures will be static presented here.
import opendatasets as od
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import tensorflow as tffrom sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance
import statsmodels.api as sm
from statsmodels.tsa.stattools import acf, pacf
import datetime
import warnings
warnings.filterwarnings('ignore')
The data is loaded automatically using opendatasets
.
dataset_url = "https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption/"
od.download(dataset_url)
df = pd.read_csv(".\hourly-energy-consumption" + "\AEP_hourly.csv", index_col=0)
df.sort_index(inplace = True)
Keep in my mind, that data cleaning was an essential first step in order to progress with the analysis. If you are interested in the details and also in state space modeling, please refer to my previous article here. ☚📰 In a nutshell, the following steps were conducted:
- Identifying gaps, when specific timestamps are missing (only single steps were identified)
- Perform imputation (using mean of previous and next records)
- Identifying and dropping duplicates
- Set timestamp column as index for dataframe
- Set dataframe index frequency to hourly, because it is a requirement for further processing
After preparing the data, let’s explore it by drawing 5 random timestamp samples and compare the time series at different scales.
fig = make_subplots(rows=5, cols=4, shared_yaxes=True, horizontal_spacing=0.01, vertical_spacing=0.04)# drawing a random sample of 5 indices without repetition
sample = sorted([x for x in np.random.choice(range(0, len(df), 1), 5, replace=False)])
# zoom x scales for plotting
periods = [9000, 3000, 720, 240]
colors = ["#E56399", "#F0B67F", "#DE6E4B", "#7FD1B9", "#7A6563"]
# s for sample datetime start
for si, s in enumerate(sample):
# p for period length
for pi, p in enumerate(periods):
cdf = df.iloc[s:(s+p+1),:].copy()
fig.add_trace(go.Scatter(x=cdf.index,
y=cdf.AEP_MW.values,
marker=dict(color=colors[si])),
row=si+1, col=pi+1)
fig.update_layout(
font=dict(family="Arial"),
margin=dict(b=8, l=8, r=8, t=8),
showlegend=False,
height=1000,
paper_bgcolor="#FFFFFF",
plot_bgcolor="#FFFFFF")
fig.update_xaxes(griddash="dot", gridcolor="#808080")
fig.update_yaxes(griddash="dot", gridcolor="#808080")