Transformer on Chen Kai Blog

LLM Engineering (1): Architectures from Transformer to MoE

Fri, 27 Mar 2026 09:00:00 +0000

The 2017 Transformer block is still the silhouette of every production LLM in 2026, but almost every internal piece has been swapped, sparsified, or specialized. This series covers the modern stack end to end — architecture, training, inference, retrieval, evaluation, safety, deployment. Chapter 1 is about the block itself: what attention looks like in a 2026 model, how MoE breaks the param-FLOPs link, and where the non-attention alternatives (Mamba, RWKV) actually beat the Transformer.

NLP (9): Deep Dive into LLM Architecture

Mon, 10 Nov 2025 09:00:00 +0000

The 2017 Transformer paper drew one block. Every production LLM today still uses that diagram as a silhouette, but almost every internal piece has been replaced. Pre-norm replaced post-norm. RMSNorm replaced LayerNorm. SwiGLU replaced GELU. Rotary embeddings replaced sinusoids. Multi-head attention became grouped-query attention. The dense FFN sometimes became a sparse mixture of experts. And the inference loop is dominated by a data structure that doesn’t appear in the original paper at all: the KV cache.

NLP (4): Attention Mechanism and Transformer

Thu, 16 Oct 2025 09:00:00 +0000

In June 2017, eight researchers at Google Brain and Google Research published a paper with a deliberately bold title: Attention Is All You Need. The architecture it introduced, the Transformer, threw away recurrence entirely. There were no LSTMs, no GRUs, no left-to-right scanning of a sentence. Instead, every token in a sequence could look at every other token directly through a single mathematical operation: scaled dot-product attention.

That one design decision unlocked massive parallelism on GPUs, eliminated the long-range dependency problems that had plagued RNNs for decades, and became the substrate on which BERT, GPT, T5, LLaMA, Claude, and essentially every modern large language model is built. If you understand this article well, the rest of the series is mostly variations on a theme.

Essence of Linear Algebra (16): Linear Algebra in Deep Learning

Wed, 16 Apr 2025 09:00:00 +0000

Strip away the marketing and a deep network is one thing: a long pipeline of matrix multiplications glued together by elementwise nonlinearities. Forward pass, backward pass, convolution, attention, normalization, fine-tuning — every “trick” is a small twist on the same algebraic theme. Once you see the matrices, the field stops looking like a bag of recipes and starts looking like a single language.

This chapter rebuilds the modern stack from that single language. We follow one signal — a vector $\mathbf{x}$ — as it flows through linear layers, gets convolved, gets attended to, gets normalized, and gets adapted by a low-rank update. At each step we name the matrix that does the work and the property of that matrix (rank, conditioning, transpose) that makes the trick succeed.

Time Series Forecasting (8): Informer — Efficient Long-Sequence Forecasting

Sun, 15 Dec 2024 09:00:00 +0000

The Transformer is wonderful at sequence modeling — right up to the moment your sequence gets long. Vanilla self-attention costs $\mathcal{O}(L^2)$ in both compute and memory, so a one-week hourly window (168 steps) is fine, a one-month window (720 steps) is painful, and a three-month window (2160 steps) is essentially impossible on a single GPU. That is exactly the regime real-world long-horizon forecasting lives in: weather, energy, finance, IoT.

Time Series Forecasting (5): Transformer Architecture for Time Series

Thu, 31 Oct 2024 09:00:00 +0000

The 2017 Attention Is All You Need paper took the attention mechanism from the previous chapter to its logical extreme: drop the RNN entirely. Transformers stack pure attention into a full sequence model — no recurrence, no hidden state propagating over time. Originally designed for machine translation, the architecture was quickly adapted to every other sequence task, time series included.

Dropping a vanilla NLP Transformer onto a time-series problem runs into two immediate complications. The first is position. Attention is a set operation — shuffle the input order and the output is unchanged. For a time series, order is everything: a temperature curve that goes up-then-down and one that goes down-then-up are entirely different signals. NLP solves this with sinusoidal position encodings; do those still make sense for time series, or should we use learned encodings, or just concatenate calendar features (hour-of-day, day-of-week) directly into the input?

Position Encoding Brief: From Sinusoidal to RoPE and ALiBi

Fri, 30 Jun 2023 09:00:00 +0000

Self-attention has a strange property that surprises most people the first time they compute it by hand: it does not know the order of its inputs. Permute the tokens and every attention score is permuted along with them — the function is exactly equivariant. So before we can do anything useful with a Transformer, we have to inject position information from the outside.

That single design decision — how to inject it — has spawned a remarkable amount of research. Sinusoidal, learned, relative, T5-style buckets, RoPE, ALiBi, NoPE, and more. This post is a practitioner’s brief: enough math to know why each scheme works, enough comparison to choose one, and a clear focus on the property that matters most in the LLM era — length extrapolation, the ability to handle sequences longer than anything seen in training.