Time Series on Chen Kai Blog

Time Series Forecasting (8): Informer -- Efficient Long-Sequence Forecasting

Sun, 15 Dec 2024 09:00:00 +0000

The Transformer is wonderful at sequence modeling – right up to the moment your sequence gets long. Vanilla self-attention costs $\mathcal{O}(L^2)$ in both compute and memory, so a one-week hourly window (168 steps) is fine, a one-month window (720 steps) is painful, and a three-month window (2160 steps) is essentially impossible on a single GPU. That is exactly the regime real-world long-horizon forecasting lives in: weather, energy, finance, IoT.

Informer (Zhou et al., AAAI 2021 best paper) is the architecture that finally made Transformers practical for these settings. It does three things, each of which would be a contribution on its own:

Time Series Forecasting (7): N-BEATS -- Interpretable Deep Architecture

Sat, 30 Nov 2024 09:00:00 +0000

The 2018 M4 forecasting competition served 100,000 series across six frequencies as a single benchmark. The leaderboard was dominated by hand-tuned ensembles built from decades of statistical-forecasting craft. Then a pure neural network with no statistical preprocessing, no feature engineering, and no recurrence won outright. That network was N-BEATS by Oreshkin et al. – a stack of fully-connected blocks with two residual paths. Its interpretable variant additionally split the forecast into a polynomial trend and a Fourier seasonality, so the very thing classical statisticians wanted (a readable decomposition) came for free.

Time Series Forecasting (6): Temporal Convolutional Networks (TCN)

Fri, 15 Nov 2024 09:00:00 +0000

For most of the 2010s, anyone who said “deep learning for time series” meant LSTM. The story changed in 2018 when Bai, Kolter, and Koltun published An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Their result was annoyingly simple: take a stack of 1-D convolutions, make them causal (no peeking at the future), space the filter taps out exponentially (dilation), wrap the whole thing in residual connections, and train. On task after task, the resulting Temporal Convolutional Network (TCN) matched or beat LSTM/GRU – while training several times faster because every time step in the forward pass runs in parallel.

Time Series Forecasting (5): Transformer Architecture for Time Series

Thu, 31 Oct 2024 09:00:00 +0000

What You Will Learn

The full encoder-decoder Transformer, redrawn for time series
Why position must be injected, and how sinusoidal / learned / time-aware encodings differ
What multi-head attention actually learns over a temporal sequence
Where vanilla attention breaks down (O(n^2)) and the four families of fixes: sparse, linear, patched, decoder-only
A clean PyTorch reference implementation, plus when to reach for Autoformer / FEDformer / Informer / PatchTST

Prerequisites

Self-attention and multi-head attention (Part 4)
Encoder-decoder architectures and teacher forcing
PyTorch fundamentals (nn.Module, training loops)

1. Why Transformers for Time Series

LSTM and GRU process a sequence step by step. Three things follow from that:

Time Series Forecasting (4): Attention Mechanisms -- Direct Long-Range Dependencies

Wed, 16 Oct 2024 09:00:00 +0000

What you will learn

Why recurrent models hit a wall on long-range dependencies, and how attention removes it.
The Query / Key / Value mechanism, scaled dot-product attention, and the role of $1/\sqrt{d_k}$.
Two classic scoring functions – Bahdanau (additive) and Luong (multiplicative).
How to wire attention into an LSTM encoder/decoder for time series.
Multi-head attention specialised for time – different heads for recency, period, anomaly.
The $O(n^2)$ memory wall and how sparse / linear attention bypass it.
A worked stock-prediction case with attention-weight overlays.

Prerequisites: RNN/LSTM/GRU intuition (Parts 2-3), basic linear algebra, PyTorch.

Time Series Forecasting (3): GRU -- Lightweight Gates and Efficiency Trade-offs

Tue, 01 Oct 2024 09:00:00 +0000

What You Will Learn

How GRU’s update gate $z_t$ and reset gate $r_t$ achieve LSTM-quality memory with one fewer gate and one fewer state.
Why GRU has exactly 25% fewer parameters than LSTM, and what that buys you in practice.
How to read GRU gate activations to debug what the model is paying attention to.
A practical decision matrix for picking GRU vs LSTM, backed by parameter, speed, and forecast-quality benchmarks.
A clean PyTorch reference implementation with the regularisation and stability tricks that actually matter.

Prerequisites

Comfort with the LSTM gates from Part 2 .
Basic PyTorch (nn.Module, autograd, optimizers).
Recall that gradient flow through tanh nonlinearities is what kills vanilla RNNs.

Figure 1. The GRU cell. Two gates (r, z) and one state (h) replace LSTM’s three gates and separate cell state. The orange (1 - z) ⊙ h_{t-1} skip path is the linear gradient highway that makes long-range learning tractable.

Time Series Forecasting (2): LSTM -- Gate Mechanisms and Long-Term Dependencies

Mon, 16 Sep 2024 09:00:00 +0000

What You Will Learn

Why vanilla RNNs fail on long sequences and how LSTM fixes the gradient problem
The intuition behind each gate (forget, input, output) and the cell-state “highway”
How to structure inputs/outputs for one-step and multi-step time series forecasting
Practical recipes: regularization, sequence length, bidirectional vs stacked LSTM, when to choose LSTM vs GRU

Prerequisites

Basic understanding of neural networks (forward pass, backpropagation)
Familiarity with PyTorch (nn.Module, tensors, optimizers)
Part 1 of this series (helpful but not required)

1. The Problem LSTM Solves

$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b).$$$$\frac{\partial h_T}{\partial h_k} = \prod_{t=k+1}^{T} \mathrm{diag}\!\left(1 - h_t^2\right) W_h.$$

Two regimes appear:

Time Series Forecasting (1): Traditional Statistical Models

Sun, 01 Sep 2024 09:00:00 +0000

Next: LSTM Deep Dive –>

What You Will Learn

Why stationarity is the entry ticket for the whole ARIMA family, and how differencing buys it.
How to read ACF and PACF plots like a Box-Jenkins practitioner: cut-off vs. tail-off as the rule for identifying $p$ and $q$.
The full ARIMA / SARIMA machinery, including how seasonality is folded in via lag-$s$ operators.
Where VAR, GARCH, exponential smoothing, Prophet and the Kalman filter sit on the same map – mean dynamics vs. variance dynamics vs. state-space recursion.
A decision rule for when a traditional model is the right answer and when to graduate to the deep models in the rest of this series.

Prerequisites

Basic probability and statistics (mean, variance, covariance, correlation).
Familiarity with NumPy and pandas time indexes.
A little linear algebra for the VAR / Kalman sections (matrix multiplication, eigenvalues).

1. Why traditional models still matter

Before the deep-learning era, the time-series toolbox was already remarkably complete. ARIMA captures linear autocorrelation, SARIMA adds calendar effects, VAR generalises to vectors, GARCH models the variance, and the Kalman filter unifies the lot inside a state-space recursion. They share three properties that deep models do not give for free: