LLM Engineering (6): Long Context — RoPE, YaRN, Sinks

Wed, 01 Apr 2026 09:00:00 +0000

“1M token context” is one of the most over-claimed numbers in LLMs. A model can attend to 1M tokens — that’s an architecture statement. A model can use information at position 800K to answer a question — that’s a behavior statement, and it’s more challenging. This chapter covers the math of position encoding, the engineering tricks that extend context beyond the training length, and why most long-context claims fail needle-in-a-haystack tests.

NLP (9): Deep Dive into LLM Architecture

Mon, 10 Nov 2025 09:00:00 +0000

The 2017 Transformer paper drew one block. Every production LLM today still uses that diagram as a silhouette, but almost every internal piece has been replaced. Pre-norm replaced post-norm. RMSNorm replaced LayerNorm. SwiGLU replaced GELU. Rotary embeddings replaced sinusoids. Multi-head attention became grouped-query attention. The dense FFN sometimes became a sparse mixture of experts. And the inference loop is dominated by a data structure that doesn’t appear in the original paper at all: the KV cache.

RoPE on Chen Kai Blog

LLM Engineering (6): Long Context — RoPE, YaRN, Sinks

NLP (9): Deep Dive into LLM Architecture