MoE

Mar 27, 2026 LLM Engineering 56 min read

LLM Engineering (1): Architectures from Transformer to MoE

MHA → GQA → MQA, sparse MoE routing in Mixtral and Qwen3-MoE, sliding-window attention, and the state-space alternatives Mamba and RWKV — what each costs and where each wins.

Nov 10, 2025 NLP 32 min read

NLP (9): Deep Dive into LLM Architecture

Inside modern LLMs: pre-norm + RMSNorm + SwiGLU + RoPE + GQA, KV cache mechanics, FlashAttention's IO-aware schedule, sparse Mixture-of-Experts, and INT8 / INT4 quantization.