Transformer

Mar 27, 2026 LLM Engineering 56 min read

LLM Engineering (1): Architectures from Transformer to MoE

MHA → GQA → MQA, sparse MoE routing in Mixtral and Qwen3-MoE, sliding-window attention, and the state-space alternatives Mamba and RWKV — what each costs and where each wins.

Nov 10, 2025 NLP 32 min read

NLP (9): Deep Dive into LLM Architecture

Inside modern LLMs: pre-norm + RMSNorm + SwiGLU + RoPE + GQA, KV cache mechanics, FlashAttention's IO-aware schedule, sparse Mixture-of-Experts, and INT8 / INT4 quantization.

Oct 16, 2025 NLP 34 min read

NLP (4): Attention Mechanism and Transformer

From the bottleneck of Seq2Seq to Attention Is All You Need. Build intuition for scaled dot-product attention, multi-head attention, positional encoding, masking, and assemble a complete Transformer in PyTorch.

Apr 16, 2025 Linear Algebra 34 min read

Essence of Linear Algebra (16): Linear Algebra in Deep Learning

Deep learning is large-scale matrix computation. From backpropagation as the chain rule in matrix form, to im2col turning convolutions into GEMM, to attention as soft retrieval via dot products -- see every core DL …

Dec 15, 2024 Time Series Forecasting 36 min read

Time Series Forecasting (8): Informer — Efficient Long-Sequence Forecasting

Informer reduces Transformer complexity from O(L^2) to O(L log L) via ProbSparse attention, distilling, and a one-shot generative decoder. Full math, PyTorch code, and ETT/weather benchmarks.

Oct 31, 2024 Time Series Forecasting 28 min read

Time Series Forecasting (5): Transformer Architecture for Time Series

Transformers for time series, end to end: encoder-decoder anatomy, temporal positional encoding, the O(n^2) attention bottleneck, decoder-only forecasting, and patching. With variants (Autoformer, FEDformer, Informer, …

Jun 30, 2023 Standalone 12 min read

Position Encoding Brief: From Sinusoidal to RoPE and ALiBi

A practitioner's tour of Transformer position encoding: why attention needs it at all, how sinusoidal/learned/relative/RoPE/ALiBi schemes differ, and which one to pick when long-context extrapolation matters.