<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Attention on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/attention/</link><description>Recent content in Attention on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 16 Oct 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/attention/index.xml" rel="self" type="application/rss+xml"/><item><title>NLP (4): Attention Mechanism and Transformer</title><link>https://www.chenk.top/en/nlp/attention-transformer/</link><pubDate>Thu, 16 Oct 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/nlp/attention-transformer/</guid><description>&lt;p>In June 2017, eight researchers at Google Brain and Google Research published a paper with a deliberately bold title: &lt;em>Attention Is All You Need&lt;/em>. The architecture it introduced, the &lt;strong>Transformer&lt;/strong>, threw away recurrence entirely. There were no LSTMs, no GRUs, no left-to-right scanning of a sentence. Instead, every token in a sequence could look at every other token directly through a single mathematical operation: scaled dot-product attention.&lt;/p>
&lt;p>That one design decision unlocked massive parallelism on GPUs, eliminated the long-range dependency problems that had plagued RNNs for decades, and became the substrate on which BERT, GPT, T5, LLaMA, Claude, and essentially every modern large language model is built. If you understand this article well, the rest of the series is mostly variations on a theme.&lt;/p></description></item><item><title>Time Series Forecasting (4): Attention Mechanisms — Direct Long-Range Dependencies</title><link>https://www.chenk.top/en/time-series/attention-mechanism/</link><pubDate>Wed, 16 Oct 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/attention-mechanism/</guid><description>&lt;p>RNNs and LSTMs handled &amp;ldquo;too many time steps&amp;rdquo; but left a subtler limitation in place: information has to travel &lt;strong>step by step&lt;/strong>. For step 100 to see what happened at step 1, the signal has to ride the hidden state through 99 intermediate stops — and each stop attenuates the signal a little and squashes it through a nonlinearity. Even with LSTM&amp;rsquo;s &amp;ldquo;highway&amp;rdquo; cell state, it&amp;rsquo;s still a single lane in a single direction.&lt;/p></description></item><item><title>Graph Contextualized Self-Attention Network (GC-SAN) for Session-based Recommendation</title><link>https://www.chenk.top/en/standalone/gcsan/</link><pubDate>Sun, 29 Jan 2023 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/standalone/gcsan/</guid><description>&lt;p>In session-based recommendation you only see a short anonymous click sequence — no user profile, no long history, no demographics. Every signal you have lives inside that single window. &lt;strong>GC-SAN&lt;/strong> (IJCAI 2019) takes the strongest two ideas of the time — SR-GNN&amp;rsquo;s session graph and the Transformer&amp;rsquo;s self-attention — and stacks them: a &lt;em>graph&lt;/em> view captures local transition patterns and loops, a &lt;em>sequence&lt;/em> view captures long-range intent, and a tiny weighted sum decides how much of each to trust. The result is a clean &amp;ldquo;best of both worlds&amp;rdquo; baseline that is genuinely hard to beat at its parameter budget.&lt;/p></description></item></channel></rss>