
LLM Engineering
End-to-end modern LLM stack: architectures, post-training, inference, RAG, evaluation, safety, and production.
01LLM Engineering (1): Architectures from Transformer to MoE
MHA → GQA → MQA, sparse MoE routing in Mixtral and Qwen3-MoE, sliding-window attention, and the state-space alternatives …
02LLM Engineering (2): Tokenization Deep Dive
BPE vs SentencePiece vs WordPiece, byte-level fallback, the CJK token-bloat problem, vocabulary expansion costs, and the …
03LLM Engineering (3): Pretraining at Scale
Data mixing, deduplication, contamination, μP, FSDP vs ZeRO-3 vs pipeline parallel, the practical 200B-token cliff, and …
04LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF
What SFT, DPO, RLHF, and RLAIF each actually optimize, when reward models fail, KL constraints, the LoRA-vs-full-FT …
05LLM Engineering (5): Inference Optimization
KV cache mechanics, paged attention, continuous batching, speculative decoding, INT8/INT4/AWQ/GPTQ quantization, and the …
06LLM Engineering (6): Long Context — RoPE, YaRN, Sinks
How RoPE encodes position, why naive extension breaks, NTK-aware and YaRN scaling, ALiBi vs RoPE, attention sinks for …
07LLM Engineering (7): Function Calling and Tool Use
JSON-mode vs function-mode vs free-form, parallel tool calls, structured-output guarantees with grammars, error recovery …
08LLM Engineering (8): Retrieval-Augmented Generation
Chunking strategies, dense vs sparse vs hybrid retrieval, reranker selection, the long-context-vs-RAG tradeoff in 2026, …
09LLM Engineering (9): Prompting at Production Scale
Chain-of-thought when it actually helps, self-consistency, prompt-caching economics, jailbreak taxonomy, …
10LLM Engineering (10): Evaluation
Why MMLU is broken, the contamination problem, LLM-as-judge biases, position-bias mitigation, calibration, and the A/B …
11LLM Engineering (11): Safety and Alignment
What alignment means engineering-wise, refusal calibration, the red-team taxonomy, hallucination metrics, sleeper …
12LLM Engineering (12): Production — Deployment, Monitoring, Cost
Serving stack choices in detail, autoscaling LLMs, latency budgets, prompt+completion cost tracking, multi-model …