Attention on Chen Kai Blog

NLP (4): Attention Mechanism and Transformer

Thu, 16 Oct 2025 09:00:00 +0000

In June 2017, eight researchers at Google Brain and Google Research published a paper with a deliberately bold title: Attention Is All You Need. The architecture it introduced, the Transformer, threw away recurrence entirely. There were no LSTMs, no GRUs, no left-to-right scanning of a sentence. Instead, every token in a sequence could look at every other token directly through a single mathematical operation: scaled dot-product attention.

That one design decision unlocked massive parallelism on GPUs, eliminated the long-range dependency problems that had plagued RNNs for decades, and became the substrate on which BERT, GPT, T5, LLaMA, Claude, and essentially every modern large language model is built. If you understand this article well, the rest of the series is mostly variations on a theme.

Time Series Forecasting (4): Attention Mechanisms — Direct Long-Range Dependencies

Wed, 16 Oct 2024 09:00:00 +0000

RNNs and LSTMs handled “too many time steps” but left a subtler limitation in place: information has to travel step by step. For step 100 to see what happened at step 1, the signal has to ride the hidden state through 99 intermediate stops — and each stop attenuates the signal a little and squashes it through a nonlinearity. Even with LSTM’s “highway” cell state, it’s still a single lane in a single direction.

Graph Contextualized Self-Attention Network (GC-SAN) for Session-based Recommendation

Sun, 29 Jan 2023 09:00:00 +0000

In session-based recommendation you only see a short anonymous click sequence — no user profile, no long history, no demographics. Every signal you have lives inside that single window. GC-SAN (IJCAI 2019) takes the strongest two ideas of the time — SR-GNN’s session graph and the Transformer’s self-attention — and stacks them: a graph view captures local transition patterns and loops, a sequence view captures long-range intent, and a tiny weighted sum decides how much of each to trust. The result is a clean “best of both worlds” baseline that is genuinely hard to beat at its parameter budget.