Time Series Forecasting (1): Traditional Statistical Models

The first time I touched data that “looked like a time series” — hourly server CPU usage — my instinct was to throw it at a linear regression. Time on the x-axis, usage on the y-axis. The fit was terrible. The problem wasn’t the regression; the problem was that this kind of data has its own personality. It has trends, seasonality, and a stubborn dependence between consecutive observations. A vanilla regression treats every row as an independent sample and throws away the one piece of information that matters most: time itself.

Traditional time-series models exist to put that information back in. They don’t pretend rows are independent. They model the fact that today’s value usually looks a lot like yesterday’s. Temperature readings, for instance, drift gently — today is 20°C, tomorrow will be somewhere in 18-22°C, it won’t suddenly jump to 5°C. Stock prices look superficially similar but behave very differently: today’s price is yesterday’s price plus a small shock, and only the shock is random. Both are curves on a screen, but their memory structure is completely opposite — one mean-reverts, the other random-walks. Telling those two apart is exactly what the ARIMA, GARCH, and state-space toolkit was designed for.

This chapter walks through that toolkit. The goal isn’t to memorize formulas; it’s to build a working sense of judgment. When you see a new dataset, do you difference first or look at the ACF? When you see a slow tail, is it AR or ARMA? When is a classical model enough, and when do you graduate to deep learning? Those questions don’t have one right answer, but there is a reusable diagnostic flow, and the rest of the chapter unpacks it.

Time Series Forecasting (1): Traditional Statistical Models — Chapter overview

What You Will Learn#

Why stationarity is the entry ticket for the whole ARIMA family, and how differencing buys it.
How to read ACF and PACF plots like a Box-Jenkins practitioner: cut-off vs. tail-off as the rule for identifying $$p$$ and $$q$$ .
The full ARIMA / SARIMA machinery, including how seasonality is folded in via lag- $$s$$ operators.
Where VAR, GARCH, exponential smoothing, Prophet and the Kalman filter sit on the same map — mean dynamics vs. variance dynamics vs. state-space recursion.
A decision rule for when a traditional model is the right answer and when to graduate to the deep models in the rest of this series.

Prerequisites#

Basic probability and statistics (mean, variance, covariance, correlation).
Familiarity with NumPy and pandas time indexes.
A little linear algebra for the VAR / Kalman sections (matrix multiplication, eigenvalues).

Why traditional models still matter#

Before the deep-learning era, the time-series toolbox was already remarkably complete. ARIMA captures linear autocorrelation, SARIMA adds calendar effects, VAR generalises to vectors, GARCH models the variance, and the Kalman filter unifies the lot inside a state-space recursion. They share three properties that deep models do not give for free:

Interpretability. Every parameter has a meaning — “yesterday’s level matters with weight $\phi_1$ ”, “the shock from two months ago decays with weight $\theta_2$ ”.
Calibrated uncertainty. Confidence intervals fall out of maximum likelihood, not from ad-hoc dropout tricks.
Sample efficiency. A few hundred observations is enough; you do not need a GPU.

If your series is short, smooth, or has clean calendar structure, a traditional model will routinely beat an LSTM and is much easier to debug. Use this article as the baseline you must defeat before reaching for anything fancier.

The decomposition view#

$$y_t = T_t + S_t + R_t,$$

a slowly moving trend $$T_t$$ , a periodic seasonal component $$S_t$$ (period $$s$$ ), and a residual $$R_t$$ that should look like noise once the structure is removed. The classical additive decomposition makes this concrete:

Classical additive decomposition of a synthetic monthly series into trend, seasonality and residual.

Fig. 1 — Classical additive decomposition. Once trend and seasonality are subtracted, the residual should be approximately stationary white noise; that is exactly the regime where ARMA-style models are well-posed.

The whole ARIMA programme can be summarised in one sentence: transform the data until what is left looks stationary, then fit a linear model with autocorrelated errors.

Stationarity, formally#

\mathbb{E}[y_t] = \mu, \qquad \mathrm{Var}(y_t) = \sigma^2, \qquad \mathrm{Cov}(y_t, y_{t-k}) = \gamma_k.

Most real series are not stationary — they trend, drift, or have variance that grows with the level. The two standard remedies are:

Differencing: $\nabla y_t = y_t - y_{t-1}$ removes a linear trend; $\nabla^2 y_t$ removes a quadratic one.
Variance-stabilising transforms: $\log y_t$ or a Box-Cox transform tames multiplicative growth.

The Augmented Dickey-Fuller (ADF) test gives a hypothesis test: $$H_0$$ is “unit root present”, so a small $$p$$ -value lets you treat the (transformed) series as stationary.

AR, MA, and ARMA — three flavours of memory#

ARIMA is built from two atomic ingredients. They look similar but encode very different ideas about what the series remembers.

Sample paths from AR(1), MA(2) and ARMA(1,1) processes contrasted on the same axes.

Fig. 2 — AR remembers past values, MA remembers past shocks, ARMA does both. The qualitative differences are visible in the sample paths even before any formal statistic is computed.

Autoregressive: AR( $$p$$ )#

y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + \varepsilon_t.

Today’s value is a linear combination of the previous $$p$$ values plus a white-noise shock $\varepsilon_t \sim \mathcal{N}(0, \sigma^2)$ . Persistence is encoded directly in the coefficients $\phi_k$ . AR(1) with $\phi_1$ near 1 produces a very smooth, slowly mean-reverting path.

Moving average: MA( $$q$$ )#

y_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + \cdots + \theta_q \varepsilon_{t-q}.

Today’s value is a linear combination of the last $$q$$ shocks. The memory window is finite: an event that happened more than $$q$$ steps ago has zero direct effect.

Combining: ARMA( $$p$$ , $$q$$ )#

\phi(B)\, y_t = \theta(B)\, \varepsilon_t,

\phi(B) = 1 - \phi_1 B - \cdots - \phi_p B^p, \qquad \theta(B) = 1 + \theta_1 B + \cdots + \theta_q B^q.

ARMA is parsimonious: a small $$(p, q)$$ can imitate the autocorrelation of an MA( $\infty$ ) or AR( $\infty$ ). For stationarity all roots of $\phi(B) = 0$ must lie outside the unit circle; for invertibility the same holds for $\theta(B)$ .

From ARMA to ARIMA#

\phi(B)\, (1-B)^d\, y_t = \theta(B)\, \varepsilon_t.

In practice $$d = 1$$ handles linear drift, $$d = 2$$ handles curvature; rarely do you need $$d > 2$$ .

ACF and PACF — the model-identification microscope#

How do you pick $$p$$ and $$q$$ ? Two diagnostic plots almost always do the job.

Autocorrelation function (ACF): $\rho_k = \mathrm{Corr}(y_t, y_{t-k})$ .
Partial autocorrelation (PACF): the correlation between $$y_t$$ and $y_{t-k}$ after removing the linear effect of $y_{t-1}, \ldots, y_{t-k+1}$ .

The Box-Jenkins identification rules:

Process	ACF	PACF
AR( $$p$$ )	tails off (decays smoothly)	cuts off after lag $$p$$
MA( $$q$$ )	cuts off after lag $$q$$	tails off
ARMA( $$p$$ , $$q$$ )	tails off	tails off

ACF and PACF of an AR(2) process and an MA(2) process.

Fig. 3 — The PACF of an AR(2) drops to zero after lag 2 (the spike pattern is the signature). Symmetrically, the ACF of an MA(2) drops to zero after lag 2. When in doubt, prefer the simpler model — the diagnostic tests in step 5 will catch under-fitting.

When neither plot has a clean cut-off, you are looking at an ARMA process; the cleanest path is then to grid-search $$(p, q)$$ and pick the pair that minimises an information criterion.

AIC and BIC#

\mathrm{AIC} = -2\,\ell(\hat{\theta}) + 2k, \qquad \mathrm{BIC} = -2\,\ell(\hat{\theta}) + k\log n,

with $\ell$ the maximised log-likelihood, $$k$$ the number of free parameters, $$n$$ the sample size. BIC penalises complexity more harshly and is the safer default when $$n$$ is small.

The Box-Jenkins workflow#

ARIMA is not a one-shot fit; it is an iterative loop that Box and Jenkins formalised in 1970. Every subsequent statistical-forecasting toolkit (including auto.arima) is just an automation of the same four boxes.

Box-Jenkins methodology: identification, estimation, diagnostic checking and forecasting, with a feedback loop when residual diagnostics fail.

Fig. 4 — The point is the feedback arrow. If the fitted residuals fail a Ljung-Box test or show structure in their ACF, the model is misspecified — go back and revise $$(p, d, q)$$ rather than trusting the forecast.

Identification. Plot the series, run ADF for stationarity, study ACF/PACF, propose a candidate $$(p, d, q)$$ .
Estimation. Fit by maximum likelihood (or conditional least squares). All major libraries do this for you.
Diagnostic checking. Are the residuals white noise?
- The Ljung-Box statistic $Q = n(n+2)\sum_{k=1}^h \hat{\rho}_k^2 / (n-k)$ should fail to reject $$H_0$$ .
- The residual ACF should lie inside the $\pm 1.96/\sqrt{n}$ band.
- QQ-plots and Jarque-Bera check normality (matters for prediction intervals).
Forecasting. Only after the residuals look clean do you produce point and interval forecasts.

ARIMA in code#

A faithful, minimal walk-through using statsmodels:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.stats.diagnostic import acorr_ljungbox

rng = np.random.default_rng(0)
n = 200
eps = rng.normal(0, 1.0, n)
y = np.zeros(n)
for t in range(1, n):
    y[t] = y[t - 1] + 0.05 + 0.6 * eps[t]

series = pd.Series(y, index=pd.date_range("2018-01-01", periods=n, freq="D"))

print("ADF on level :", adfuller(series)[1])         # large p -> non-stationary
print("ADF on diff  :", adfuller(series.diff().dropna())[1])  # small p -> stationary

train, test = series.iloc[:-30], series.iloc[-30:]
model = ARIMA(train, order=(2, 1, 1)).fit()

lb = acorr_ljungbox(model.resid, lags=[10], return_df=True)
print(lb)  # high p-value -> residuals are white noise

fc = model.get_forecast(steps=30)
mean_fc = fc.predicted_mean
ci = fc.conf_int(alpha=0.05)

The forecast on a held-out tail looks like this:

ARIMA(2,1,1) forecast with 95% confidence band on a held-out segment.

Fig. 5 — The point forecast quickly reverts to the long-run drift, and the interval fans out at rate $\sigma\sqrt{h}$ in the horizon $$h$$ . That widening cone is the honest representation of how little a linear model truly knows about the future.

Seasonality: SARIMA#

\Phi(B^s)\, \phi(B)\, (1-B)^d (1-B^s)^D y_t \;=\; \Theta(B^s)\, \theta(B)\, \varepsilon_t.

You apply two kinds of differencing — regular ( $$1-B$$ ) for trend and seasonal ( $$1-B^s$$ ) for the period — and then attach AR/MA terms at both regular lags and lags that are multiples of $$s$$ .

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(
    train,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 12),  # P, D, Q, s
    enforce_stationarity=False,
    enforce_invertibility=False,
).fit(disp=False)

forecast = model.get_forecast(steps=24)

SARIMA(1,1,1)(1,1,1,12) forecast on a series with strong yearly seasonality.

Fig. 6 — The seasonal cycle is reproduced cleanly out of sample because the seasonal-difference operator $(1-B^{12})$ has stripped the yearly pattern from the residual, and the seasonal AR/MA terms put it back in the forecast at the right phase.

Identification cheat sheet for the seasonal part: look at the ACF/PACF of the seasonally differenced series at the seasonal lags $s, 2s, 3s, \ldots$ . The same cut-off / tail-off rules apply as in the non-seasonal case.

Beyond ARIMA: the rest of the family#

The ideas above generalise in four useful directions. They are not separate worlds — each one specialises ARIMA in a single dimension.

VAR — multivariate dynamics#

\mathbf{y}_t = \mathbf{c} + A_1 \mathbf{y}_{t-1} + A_2 \mathbf{y}_{t-2} + \cdots + A_p \mathbf{y}_{t-p} + \boldsymbol{\varepsilon}_t.

Each entry of the matrix $$A_k$$ has an interpretation: $(A_k)_{ij}$ is the marginal effect of variable $$j$$ at lag $$k$$ on variable $$i$$ today. This makes VAR popular in macroeconomics where Granger causality (“does $$x$$ help predict $$y$$ beyond $$y$$ ’s own past?”) is the central question.

A practical caveat: with $$K$$ series and lag $$p$$ the model has $$K + pK^2$$ free parameters. For $$K = 10, p = 4$$ that is already 410 numbers from probably a few hundred observations. Hence the regularised cousins — Bayesian VAR, factor models — exist for high-dimensional settings.

GARCH — variance dynamics#

\sigma_t^2 = \omega + \alpha\, \varepsilon_{t-1}^2 + \beta\, \sigma_{t-1}^2, \qquad \varepsilon_t = \sigma_t\, z_t,\quad z_t \sim \mathcal{N}(0, 1).

The $\alpha$ term lets a large shock yesterday push variance up today (the “ARCH” effect); the $\beta$ term lets variance be persistent. Stationarity requires $\alpha + \beta < 1$ , and in financial data $\alpha + \beta$ typically lands near 0.95-0.99 — volatility is a slow-moving thing.

GARCH is the standard tool for risk management (VaR, options pricing) and pairs naturally with an ARIMA mean model: fit ARIMA to the returns, then GARCH to the squared residuals.

Exponential smoothing and Holt-Winters#

\ell_t = \alpha\, y_t + (1-\alpha)\, \ell_{t-1}.

\ell_t = \alpha (y_t - s_{t-s}) + (1-\alpha)(\ell_{t-1} + b_{t-1}),\\ b_t = \beta(\ell_t - \ell_{t-1}) + (1-\beta) b_{t-1},\\ s_t = \gamma(y_t - \ell_t) + (1-\gamma) s_{t-s}.

This is the workhorse behind the ETS family in R / statsmodels, and many of the methods that won the M-competitions are sophisticated descendants of it.

Holt-Winters additive forecast and the underlying level / trend / seasonal components.

Fig. 7 — The decomposition gives Holt-Winters its appeal: each component is intelligible on its own, the smoothing coefficients ( $\alpha, \beta, \gamma$ ) say how quickly each one adapts, and the recursive form means it costs $$O(n)$$ to fit.

Prophet#

y(t) = g(t) + s(t) + h(t) + \varepsilon_t,

where $$g(t)$$ is a piecewise-linear or logistic trend with changepoints, $$s(t)$$ is a Fourier expansion for multiple seasonalities, and $$h(t)$$ encodes user-supplied holidays. It is fitted with Stan via MAP or full Bayesian sampling, exposes only a handful of tunables, and is robust to missing data and outliers. In practice it is not state of the art on benchmarks, but the API quality and the holiday handling are why it remains a default in product analytics.

The Kalman filter and the state-space view#

\mathbf{x}_t = F_t \mathbf{x}_{t-1} + \mathbf{w}_t, \qquad \mathbf{w}_t \sim \mathcal{N}(0, Q_t),\\ \mathbf{y}_t = H_t \mathbf{x}_t + \mathbf{v}_t, \qquad \mathbf{v}_t \sim \mathcal{N}(0, R_t).

\hat{\mathbf{x}}_{t|t-1} = F_t\, \hat{\mathbf{x}}_{t-1},\\ P_{t|t-1} = F_t\, P_{t-1}\, F_t^\top + Q_t,\\ K_t = P_{t|t-1} H_t^\top (H_t P_{t|t-1} H_t^\top + R_t)^{-1},\\ \hat{\mathbf{x}}_t = \hat{\mathbf{x}}_{t|t-1} + K_t (\mathbf{y}_t - H_t \hat{\mathbf{x}}_{t|t-1}),\\ P_t = (I - K_t H_t)\, P_{t|t-1}.

Why this matters: ARIMA, exponential smoothing, dynamic linear regressions, structural models with seasonal dummies — all of them can be cast in the form above and inherit the Kalman recursion for free. That is exactly how statsmodels’s SARIMAX works under the hood, and why it can handle missing observations and exogenous regressors with no extra effort.

Choosing a model#

Model	Best when	Avoid when
ARIMA	One series, linear autocorrelation, no calendar	Strong seasonality, multiple inputs
SARIMA	Clear single seasonal period, moderate length	Multiple overlapping seasonalities
VAR	Several stationary series with feedback	High-dimensional ( $K \gtrsim 10$ )
GARCH	Returns / volatility data with clustering	Slow, smooth series with constant variance
Holt-Winters / ETS	Robust seasonal baseline, fast retraining	Highly non-linear regimes
Prophet	Business series with holidays, missing data	Sub-daily high-frequency data
Kalman / state-space	Online updating, missing data, custom structure	When a black-box is acceptable and abundant data is available

A pragmatic recipe: start with ETS or SARIMA as a baseline, add GARCH if the variance is the quantity you care about, and only move to deep models (LSTM, TCN, Transformer, N-BEATS, Informer — the rest of this series) when the residuals show clear non-linear structure or you have many parallel series to share information across.

Limits, and what comes next#

Traditional models reach their ceiling when:

The dynamics are non-linear (regime switches, thresholds, hard saturations).
You have hundreds or thousands of parallel series that should share parameters.
The relevant context window is very long and ARMA roots cannot represent it parsimoniously.
The signal is non-Gaussian in a way that even Box-Cox cannot fix.

That is the launchpad for the next seven articles. We start with LSTM as the canonical non-linear sequential model, then build up through GRU, attention, the Transformer, TCN, N-BEATS, and finally Informer for very long sequences. Each of them generalises one of the linear models above; treat ARIMA / SARIMA / Holt-Winters as the baselines you must beat, not as the past you can skip.

What’s next#

You now have a complete classical toolkit: ARIMA for mean dynamics, GARCH for volatility, Prophet for strong seasonality, Kalman filter for state-space recursion. On problems with clear structure, enough data, and no need to capture nonlinear interactions, these models are usually more stable, more interpretable, and cheaper to tune than anything deep.

They also have well-defined failure modes. When variables interact nonlinearly (think multi-sensor systems where each affects the others), when sequences are very long (thousands of steps), or when you need a single model to forecast hundreds of thousands of related series at once — classical methods start to creak. That’s the gap the rest of the series fills.

The next chapter starts with recurrent neural networks and focuses on LSTM : how its gating mechanism preserves useful information across long sequences, how it solves the vanishing-gradient problem, and why it was the de facto deep model for sequences from 2015 to 2018. If your current project is comfortably handled by ARIMA, my advice is to keep using it — but practice the ACF/PACF diagnostic flow from this chapter until it’s automatic. You’ll reuse it under every later model, including to sanity-check the residuals of deep architectures.

References#

Box, G., Jenkins, G., Reinsel, G., & Ljung, G. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley.
Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts. https://otexts.com/fpp3/
Durbin, J., & Koopman, S. J. (2012). Time Series Analysis by State Space Methods (2nd ed.). Oxford University Press.
Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4), 987–1007.
Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3), 307–327.
Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37-45.

Time Series Forecasting (1): Traditional Statistical Models

What You Will Learn#

Prerequisites#

Why traditional models still matter#