Mamba

Mar 27, 2026 LLM Engineering 56 min read

LLM Engineering (1): Architectures from Transformer to MoE

MHA → GQA → MQA, sparse MoE routing in Mixtral and Qwen3-MoE, sliding-window attention, and the state-space alternatives Mamba and RWKV — what each costs and where each wins.