<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Time Series on Chen Kai Blog</title><link>https://www.chenk.top/en/time-series/</link><description>Recent content in Time Series on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sun, 15 Dec 2024 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/time-series/index.xml" rel="self" type="application/rss+xml"/><item><title>Time Series Forecasting (8): Informer -- Efficient Long-Sequence Forecasting</title><link>https://www.chenk.top/en/time-series/informer-long-sequence/</link><pubDate>Sun, 15 Dec 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/informer-long-sequence/</guid><description>&lt;p>The Transformer is wonderful at sequence modeling &amp;ndash; right up to the moment your sequence gets long. Vanilla self-attention costs $\mathcal{O}(L^2)$ in both compute and memory, so a one-week hourly window (168 steps) is fine, a one-month window (720 steps) is painful, and a three-month window (2160 steps) is essentially impossible on a single GPU. That is exactly the regime real-world long-horizon forecasting lives in: weather, energy, finance, IoT.&lt;/p>
&lt;p>&lt;strong>Informer&lt;/strong> (Zhou et al., AAAI 2021 best paper) is the architecture that finally made Transformers practical for these settings. It does three things, each of which would be a contribution on its own:&lt;/p></description></item><item><title>Time Series Forecasting (7): N-BEATS -- Interpretable Deep Architecture</title><link>https://www.chenk.top/en/time-series/n-beats/</link><pubDate>Sat, 30 Nov 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/n-beats/</guid><description>&lt;p>The 2018 M4 forecasting competition served 100,000 series across six frequencies as a single benchmark. The leaderboard was dominated by hand-tuned ensembles built from decades of statistical-forecasting craft. Then a &lt;strong>pure neural network&lt;/strong> with no statistical preprocessing, no feature engineering, and no recurrence won outright. That network was &lt;strong>N-BEATS&lt;/strong> by Oreshkin et al. &amp;ndash; a stack of fully-connected blocks with two residual paths. Its interpretable variant additionally split the forecast into a polynomial trend and a Fourier seasonality, so the very thing classical statisticians wanted (a readable decomposition) came for free.&lt;/p></description></item><item><title>Time Series Forecasting (6): Temporal Convolutional Networks (TCN)</title><link>https://www.chenk.top/en/time-series/temporal-convolutional-networks/</link><pubDate>Fri, 15 Nov 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/temporal-convolutional-networks/</guid><description>&lt;p>For most of the 2010s, anyone who said &amp;ldquo;deep learning for time series&amp;rdquo; meant LSTM. The story changed in 2018 when Bai, Kolter, and Koltun published &lt;em>An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling&lt;/em>. Their result was annoyingly simple: take a stack of 1-D convolutions, make them causal (no peeking at the future), space the filter taps out exponentially (dilation), wrap the whole thing in residual connections, and train. On task after task, the resulting &lt;strong>Temporal Convolutional Network&lt;/strong> (TCN) matched or beat LSTM/GRU &amp;ndash; while training several times faster because every time step in the forward pass runs in parallel.&lt;/p></description></item><item><title>Time Series Forecasting (5): Transformer Architecture for Time Series</title><link>https://www.chenk.top/en/time-series/transformer/</link><pubDate>Thu, 31 Oct 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/transformer/</guid><description>&lt;h2 id="what-you-will-learn">What You Will Learn&lt;/h2>
&lt;ul>
&lt;li>The full encoder-decoder Transformer, redrawn for time series&lt;/li>
&lt;li>Why position must be injected, and how sinusoidal / learned / time-aware encodings differ&lt;/li>
&lt;li>What multi-head attention actually learns over a temporal sequence&lt;/li>
&lt;li>Where vanilla attention breaks down (O(n^2)) and the four families of fixes: sparse, linear, patched, decoder-only&lt;/li>
&lt;li>A clean PyTorch reference implementation, plus when to reach for Autoformer / FEDformer / Informer / PatchTST&lt;/li>
&lt;/ul>
&lt;h2 id="prerequisites">Prerequisites&lt;/h2>
&lt;ul>
&lt;li>Self-attention and multi-head attention (Part 4)&lt;/li>
&lt;li>Encoder-decoder architectures and teacher forcing&lt;/li>
&lt;li>PyTorch fundamentals (&lt;code>nn.Module&lt;/code>, training loops)&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="1-why-transformers-for-time-series">1. Why Transformers for Time Series&lt;/h2>
&lt;p>LSTM and GRU process a sequence step by step. Three things follow from
that:&lt;/p></description></item><item><title>Time Series Forecasting (4): Attention Mechanisms -- Direct Long-Range Dependencies</title><link>https://www.chenk.top/en/time-series/attention-mechanism/</link><pubDate>Wed, 16 Oct 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/attention-mechanism/</guid><description>&lt;h2 id="what-you-will-learn">What you will learn&lt;/h2>
&lt;ul>
&lt;li>Why recurrent models hit a wall on long-range dependencies, and how attention removes it.&lt;/li>
&lt;li>The Query / Key / Value mechanism, scaled dot-product attention, and the role of $1/\sqrt{d_k}$.&lt;/li>
&lt;li>Two classic scoring functions &amp;ndash; &lt;strong>Bahdanau&lt;/strong> (additive) and &lt;strong>Luong&lt;/strong> (multiplicative).&lt;/li>
&lt;li>How to wire &lt;strong>attention into an LSTM encoder/decoder&lt;/strong> for time series.&lt;/li>
&lt;li>&lt;strong>Multi-head attention&lt;/strong> specialised for time &amp;ndash; different heads for recency, period, anomaly.&lt;/li>
&lt;li>The $O(n^2)$ memory wall and how sparse / linear attention bypass it.&lt;/li>
&lt;li>A worked &lt;strong>stock-prediction case&lt;/strong> with attention-weight overlays.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Prerequisites&lt;/strong>: RNN/LSTM/GRU intuition (Parts 2-3), basic linear algebra, PyTorch.&lt;/p></description></item><item><title>Time Series Forecasting (3): GRU -- Lightweight Gates and Efficiency Trade-offs</title><link>https://www.chenk.top/en/time-series/gru/</link><pubDate>Tue, 01 Oct 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/gru/</guid><description>&lt;h2 id="what-you-will-learn">What You Will Learn&lt;/h2>
&lt;ul>
&lt;li>How GRU&amp;rsquo;s &lt;strong>update gate&lt;/strong> $z_t$ and &lt;strong>reset gate&lt;/strong> $r_t$ achieve LSTM-quality memory with one fewer gate and one fewer state.&lt;/li>
&lt;li>Why GRU has exactly &lt;strong>25% fewer parameters&lt;/strong> than LSTM, and what that buys you in practice.&lt;/li>
&lt;li>How to read GRU &lt;strong>gate activations&lt;/strong> to debug what the model is paying attention to.&lt;/li>
&lt;li>A practical &lt;strong>decision matrix&lt;/strong> for picking GRU vs LSTM, backed by parameter, speed, and forecast-quality benchmarks.&lt;/li>
&lt;li>A clean PyTorch reference implementation with the regularisation and stability tricks that actually matter.&lt;/li>
&lt;/ul>
&lt;h2 id="prerequisites">Prerequisites&lt;/h2>
&lt;ul>
&lt;li>Comfort with the LSTM gates from &lt;a href="https://www.chenk.top/en/time-series-lstm/">Part 2&lt;/a>
.&lt;/li>
&lt;li>Basic PyTorch (&lt;code>nn.Module&lt;/code>, autograd, optimizers).&lt;/li>
&lt;li>Recall that gradient flow through tanh nonlinearities is what kills vanilla RNNs.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;p>&lt;figure>
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/time-series/gru/fig1_gru_cell_architecture.png" alt="GRU cell with reset and update gates and the (1 - z) gradient highway from h_{t-1} to h_t." loading="lazy" decoding="async">
 
&lt;/figure>

&lt;em>Figure 1. The GRU cell. Two gates (&lt;code>r&lt;/code>, &lt;code>z&lt;/code>) and one state (&lt;code>h&lt;/code>) replace LSTM&amp;rsquo;s three gates and separate cell state. The orange &lt;code>(1 - z) ⊙ h_{t-1}&lt;/code> skip path is the linear gradient highway that makes long-range learning tractable.&lt;/em>&lt;/p></description></item><item><title>Time Series Forecasting (2): LSTM -- Gate Mechanisms and Long-Term Dependencies</title><link>https://www.chenk.top/en/time-series/lstm/</link><pubDate>Mon, 16 Sep 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/lstm/</guid><description>&lt;h2 id="what-you-will-learn">What You Will Learn&lt;/h2>
&lt;ul>
&lt;li>Why vanilla RNNs fail on long sequences and how LSTM fixes the gradient problem&lt;/li>
&lt;li>The intuition behind each gate (forget, input, output) and the cell-state &amp;ldquo;highway&amp;rdquo;&lt;/li>
&lt;li>How to structure inputs/outputs for one-step and multi-step time series forecasting&lt;/li>
&lt;li>Practical recipes: regularization, sequence length, bidirectional vs stacked LSTM, when to choose LSTM vs GRU&lt;/li>
&lt;/ul>
&lt;h2 id="prerequisites">Prerequisites&lt;/h2>
&lt;ul>
&lt;li>Basic understanding of neural networks (forward pass, backpropagation)&lt;/li>
&lt;li>Familiarity with PyTorch (&lt;code>nn.Module&lt;/code>, tensors, optimizers)&lt;/li>
&lt;li>Part 1 of this series (helpful but not required)&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="1-the-problem-lstm-solves">1. The Problem LSTM Solves&lt;/h2>
$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b).$$$$\frac{\partial h_T}{\partial h_k} = \prod_{t=k+1}^{T} \mathrm{diag}\!\left(1 - h_t^2\right) W_h.$$&lt;p>Two regimes appear:&lt;/p></description></item><item><title>Time Series Forecasting (1): Traditional Statistical Models</title><link>https://www.chenk.top/en/time-series/01-traditional-models/</link><pubDate>Sun, 01 Sep 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/01-traditional-models/</guid><description>&lt;blockquote>
&lt;p>&lt;a href="https://www.chenk.top/en/time-series-lstm/">Next: LSTM Deep Dive &amp;ndash;&amp;gt;&lt;/a>
&lt;/p>
&lt;/blockquote>
&lt;h2 id="what-you-will-learn">What You Will Learn&lt;/h2>
&lt;ul>
&lt;li>Why &lt;strong>stationarity&lt;/strong> is the entry ticket for the whole ARIMA family, and how differencing buys it.&lt;/li>
&lt;li>How to read &lt;strong>ACF and PACF&lt;/strong> plots like a Box-Jenkins practitioner: cut-off vs. tail-off as the rule for identifying $p$ and $q$.&lt;/li>
&lt;li>The full &lt;strong>ARIMA / SARIMA&lt;/strong> machinery, including how seasonality is folded in via lag-$s$ operators.&lt;/li>
&lt;li>Where &lt;strong>VAR, GARCH, exponential smoothing, Prophet and the Kalman filter&lt;/strong> sit on the same map &amp;ndash; mean dynamics vs. variance dynamics vs. state-space recursion.&lt;/li>
&lt;li>A decision rule for when a traditional model is the right answer and when to graduate to the deep models in the rest of this series.&lt;/li>
&lt;/ul>
&lt;h2 id="prerequisites">Prerequisites&lt;/h2>
&lt;ul>
&lt;li>Basic probability and statistics (mean, variance, covariance, correlation).&lt;/li>
&lt;li>Familiarity with NumPy and &lt;code>pandas&lt;/code> time indexes.&lt;/li>
&lt;li>A little linear algebra for the VAR / Kalman sections (matrix multiplication, eigenvalues).&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="1-why-traditional-models-still-matter">1. Why traditional models still matter&lt;/h2>
&lt;p>Before the deep-learning era, the time-series toolbox was already remarkably complete. ARIMA captures linear autocorrelation, SARIMA adds calendar effects, VAR generalises to vectors, GARCH models the variance, and the Kalman filter unifies the lot inside a state-space recursion. They share three properties that deep models do not give for free:&lt;/p></description></item></channel></rss>