MLE provides a framework that precisely tackles this question. It introduces a likelihood function, which is a function that yields another function. This likelihood function takes a vector of parameters, often denoted as theta, and produces a probability density function (PDF) that depends on theta.
The probability density function (PDF) of a distribution is a function that takes a value, x, and returns its probability within the distribution. Therefore, likelihood functions are typically expressed as follows:
The value of this function indicates the likelihood of observing x from the distribution defined by the PDF with theta as its parameters.
The goal
When constructing a forecast model, we have data samples and a parameterized model, and our goal is to estimate the model’s parameters. In our examples, such as Regression and MA models, these parameters are the coefficients in the respective model formulas.
The equivalent in MLE is that we have observations and a PDF for a distribution defined over a set of parameters, theta, which are unknown and not directly observable. Our goal is to estimate theta.
The MLE approach involves finding the set of parameters, theta, that maximizes the likelihood function given the observable data, x.
We assume our samples, x, are drawn from a distribution with a known PDF that depends on a set of parameters, theta. This implies that the likelihood (probability) of observing x under this PDF is essentially 1. Therefore, identifying the theta values that make our likelihood function value close to 1 on our samples, should reveal the true parameter values.
Conditional likelihood
Notice that we haven’t made any assumptions about the distribution (PDF) on which the likelihood function is based. Now, let’s assume our observation X is a vector (x_1, x_2, …, x_n). We’ll consider a probability function that represents the probability of observing x_n conditional on that we have already observed (x_1, x_2, …, x_{n-1}) —
This represents the likelihood of observing just x_n given the previous values (and theta, the set of parameters). Now, we define the conditional likelihood function as follows:
Later, we will see why it is useful to employ the conditional likelihood function rather than the exact likelihood function.
Log-Likelihood
In practice, it is often convenient to use the natural logarithm of the likelihood function, referred to as the log-likelihood function:
This is more convenient because we often work with a likelihood function that is a joint probability function of independent variables, which translates to the product of each variable’s probability. Taking the logarithm converts this product into a sum.
For simplicity, I’ll demonstrate how to estimate the most basic moving average model — MA(1):
Here, x_t represents the time-series observations, alpha and beta are the model parameters to be estimated, and the epsilons are random noise drawn from a normal distribution with zero mean and some variance — sigma, which will also be estimated. Therefore, our “theta” is (alpha, beta, sigma), which we aim to estimate.
Let’s define our parameters and generate some synthetic data using Python:
import pandas as pd
import numpy as npSTD = 3.3
MEAN = 0
ALPHA = 18
BETA = 0.7
N = 1000
df = pd.DataFrame({"et": np.random.normal(loc=MEAN, scale=STD, size=N)})
df["et-1"] = df["et"].shift(1, fill_value=0)
df["xt"] = ALPHA + (BETA*df["et-1"]) + df["et"]
Note that we have set the standard deviation of the error distribution to 3.3, with alpha at 18 and beta at 0.7. The data looks like this —
Likelihood function for MA(1)
Our objective is to construct a likelihood function that addresses the question: how likely is it to observe our time series X=(x_1, …, x_n) assuming they are generated by the MA(1) process described earlier?
The challenge in computing this probability lies in the mutual dependence among our samples — as evident from the fact that both x_t and x_{t-1} depend on e_{t-1) — making it non-trivial to determine the joint probability of observing all samples (referred to as the exact likelihood).
Therefore, as discussed previously, instead of computing the exact likelihood, we’ll work with a conditional likelihood. Let’s begin with the likelihood of observing a single sample given all previous samples:
This is much simpler to calculate because —
All that remains is to calculate the conditional likelihood of observing all samples:
applying a natural logarithm gives:
which is the function we should maximize.
Code
We’ll utilize the GenericLikelihoodModel
class from statsmodels for our MLE estimation implementation. As outlined in the tutorial on statsmodels’ website, we simply need to subclass this class and include our likelihood function calculation:
from scipy import stats
from statsmodels.base.model import GenericLikelihoodModel
import statsmodels.api as smclass MovingAverageMLE(GenericLikelihoodModel):
def initialize(self):
super().initialize()
extra_params_names = ['beta', 'std']
self._set_extra_params_names(extra_params_names)
self.start_params = np.array([0.1, 0.1, 0.1])
def calc_conditional_et(self, intercept, beta):
df = pd.DataFrame({"xt": self.endog})
ets = [0.0]
for i in range(1, len(df)):
ets.append(df.iloc[i]["xt"] - intercept - (beta*ets[i-1]))
return ets
def loglike(self, params):
ets = self.calc_conditional_et(params[0], params[1])
return stats.norm.logpdf(
ets,
scale=params[2],
).sum()
The function loglike
is essential to implement. Given the iterated parameter values params
and the dependent variables (in this case, the time series samples), which are stored as class members self.endog
, it calculates the conditional log-likelihood value, as we discussed earlier.
Now let’s create the model and fit on our simulated data:
df = sm.add_constant(df) # add intercept for estimation (alpha)
model = MovingAverageMLE(df["xt"], df["const"])
r = model.fit()
r.summary()
and the output is:
And that’s it! As demonstrated, MLE successfully estimated the parameters we selected for simulation.
Estimating even a simple MA(1) model with maximum likelihood demonstrates the power of this method, which not only allows us to make efficient use of our data but also provides a solid statistical foundation for understanding and interpreting the dynamics of time series data.
Hope you liked it !
[1] Andrew Lesniewski, Time Series Analysis, 2019, Baruch College, New York
[2] Eric Zivot, Estimation of ARMA Models, 2005
Unless otherwise noted, all images are by the author