LLM Engineering

End-to-end modern LLM stack: architectures, post-training, inference, RAG, evaluation, safety, and production.

12 articles

01
LLM Engineering (1): Architectures from Transformer to MoE
MHA → GQA → MQA, sparse MoE routing in Mixtral and Qwen3-MoE, sliding-window attention, and the state-space alternatives …
2026-03-27 56 min
02
LLM Engineering (2): Tokenization Deep Dive
BPE vs SentencePiece vs WordPiece, byte-level fallback, the CJK token-bloat problem, vocabulary expansion costs, and the …
2026-03-28 18 min
03
LLM Engineering (3): Pretraining at Scale
Data mixing, deduplication, contamination, μP, FSDP vs ZeRO-3 vs pipeline parallel, the practical 200B-token cliff, and …
2026-03-29 42 min
04
LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF
What SFT, DPO, RLHF, and RLAIF each actually optimize, when reward models fail, KL constraints, the LoRA-vs-full-FT …
2026-03-30 52 min
05
LLM Engineering (5): Inference Optimization
KV cache mechanics, paged attention, continuous batching, speculative decoding, INT8/INT4/AWQ/GPTQ quantization, and the …
2026-03-31 42 min
06
LLM Engineering (6): Long Context — RoPE, YaRN, Sinks
How RoPE encodes position, why naive extension breaks, NTK-aware and YaRN scaling, ALiBi vs RoPE, attention sinks for …
2026-04-01 34 min
07
LLM Engineering (7): Function Calling and Tool Use
JSON-mode vs function-mode vs free-form, parallel tool calls, structured-output guarantees with grammars, error recovery …
2026-04-02 34 min
08
LLM Engineering (8): Retrieval-Augmented Generation
Chunking strategies, dense vs sparse vs hybrid retrieval, reranker selection, the long-context-vs-RAG tradeoff in 2026, …
2026-04-03 34 min
09
LLM Engineering (9): Prompting at Production Scale
Chain-of-thought when it actually helps, self-consistency, prompt-caching economics, jailbreak taxonomy, …
2026-04-04 40 min
10
LLM Engineering (10): Evaluation
Why MMLU is broken, the contamination problem, LLM-as-judge biases, position-bias mitigation, calibration, and the A/B …
2026-04-05 36 min
11
LLM Engineering (11): Safety and Alignment
What alignment means engineering-wise, refusal calibration, the red-team taxonomy, hallucination metrics, sleeper …
2026-04-06 36 min
12
LLM Engineering (12): Production — Deployment, Monitoring, Cost
Serving stack choices in detail, autoscaling LLMs, latency budgets, prompt+completion cost tracking, multi-model …
2026-04-07 36 min