Time Series on Chen Kai Blog

Time Series Forecasting (8): Informer — Efficient Long-Sequence Forecasting

Sun, 15 Dec 2024 09:00:00 +0000

The Transformer is wonderful at sequence modeling — right up to the moment your sequence gets long. Vanilla self-attention costs $\mathcal{O}(L^2)$ in both compute and memory, so a one-week hourly window (168 steps) is fine, a one-month window (720 steps) is painful, and a three-month window (2160 steps) is essentially impossible on a single GPU. That is exactly the regime real-world long-horizon forecasting lives in: weather, energy, finance, IoT.

Time Series Forecasting (7): N-BEATS — Interpretable Deep Architecture

Sat, 30 Nov 2024 09:00:00 +0000

The 2018 M4 forecasting competition served 100,000 series across six frequencies as a single benchmark. The leaderboard was dominated by hand-tuned ensembles built from decades of statistical-forecasting craft. Then a pure neural network with no statistical preprocessing, no feature engineering, and no recurrence won outright. That network was N-BEATS by Oreshkin et al. — a stack of fully-connected blocks with two residual paths. Its interpretable variant additionally split the forecast into a polynomial trend and a Fourier seasonality, so the very thing classical statisticians wanted (a readable decomposition) came for free.

Time Series Forecasting (6): Temporal Convolutional Networks (TCN)

Fri, 15 Nov 2024 09:00:00 +0000

For most of the 2010s, saying “deep learning for time series” meant using LSTM. The story changed in 2018 when Bai, Kolter, and Koltun published An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Their result was surprisingly simple: use a stack of 1-D convolutions, make them causal (no peeking at the future), space the filter taps exponentially (dilation), wrap the whole thing in residual connections, and train. Task after task, the resulting Temporal Convolutional Network (TCN) matched or beat LSTM/GRU — while training several times faster because every time step in the forward pass runs in parallel.

Time Series Forecasting (5): Transformer Architecture for Time Series

Thu, 31 Oct 2024 09:00:00 +0000

The 2017 Attention Is All You Need paper took the attention mechanism from the previous chapter to its logical extreme: drop the RNN entirely. Transformers stack pure attention into a full sequence model — no recurrence, no hidden state propagating over time. Originally designed for machine translation, the architecture was quickly adapted to every other sequence task, time series included.

Dropping a vanilla NLP Transformer onto a time-series problem runs into two immediate complications. The first is position. Attention is a set operation — shuffle the input order and the output is unchanged. For a time series, order is everything: a temperature curve that goes up-then-down and one that goes down-then-up are entirely different signals. NLP solves this with sinusoidal position encodings; do those still make sense for time series, or should we use learned encodings, or just concatenate calendar features (hour-of-day, day-of-week) directly into the input?

Time Series Forecasting (4): Attention Mechanisms — Direct Long-Range Dependencies

Wed, 16 Oct 2024 09:00:00 +0000

RNNs and LSTMs handled “too many time steps” but left a subtler limitation in place: information has to travel step by step. For step 100 to see what happened at step 1, the signal has to ride the hidden state through 99 intermediate stops — and each stop attenuates the signal a little and squashes it through a nonlinearity. Even with LSTM’s “highway” cell state, it’s still a single lane in a single direction.

Time Series Forecasting (3): GRU — Lightweight Gates and Efficiency Trade-offs

Tue, 01 Oct 2024 09:00:00 +0000

After you’ve used LSTM for a while, an obvious question shows up: aren’t three gates a bit much? The forget and input gates seem to do related work — one decides what to drop, the other decides what to add — couldn’t they be merged? And does the cell state really need to be a separate vector from the hidden state, or could the hidden state do double duty?

That is exactly the question Cho et al. answered in 2014 with the Gated Recurrent Unit. They collapsed three gates into two: an update gate that controls how much of the old state to keep versus how much new content to absorb, and a reset gate that decides whether to ignore the old state entirely when computing a fresh candidate. The cell state is folded back into the hidden state. The result is roughly 25% fewer parameters, training that runs 10-15% faster, and accuracy on most time-series tasks that is statistically indistinguishable from LSTM.

Time Series Forecasting (2): LSTM — Gate Mechanisms and Long-Term Dependencies

Mon, 16 Sep 2024 09:00:00 +0000

The first RNN I ever trained, back in 2017, was a small sales forecaster: 50 days in, the next day out. The forward pass ran cleanly, the loss went down, and yet the model had near-total amnesia about anything older than three days. The data had a clear monthly cycle. The model couldn’t see it. I assumed I needed more data, so I added rows and layers — and watched the training loss jump to NaN halfway through epoch two.

Time Series Forecasting (1): Traditional Statistical Models

Sun, 01 Sep 2024 09:00:00 +0000

The first time I touched data that “looked like a time series” — hourly server CPU usage — my instinct was to throw it at a linear regression. Time on the x-axis, usage on the y-axis. The fit was terrible. The problem wasn’t the regression; the problem was that this kind of data has its own personality. It has trends, seasonality, and a stubborn dependence between consecutive observations. A vanilla regression treats every row as an independent sample and throws away the one piece of information that matters most: time itself.