LLM Engineering
End-to-end modern LLM stack: architectures, post-training, inference, RAG, evaluation, safety, and production.
LLM Engineering (12): Production — Deployment, Monitoring, Cost
Serving stack choices in detail, autoscaling LLMs, latency budgets, prompt+completion cost tracking, multi-model routing, FrugalGPT cascading, observability you need from day one, and the on-call patterns that work.
LLM Engineering (11): Safety and Alignment
What alignment means engineering-wise, refusal calibration, the red-team taxonomy, hallucination metrics, sleeper agents, refusal as a feature vector, constitutional AI, and what shipping safely actually requires in …
LLM Engineering (10): Evaluation
Why MMLU is broken, the contamination problem, LLM-as-judge biases, position-bias mitigation, calibration, and the A/B testing patterns that actually catch regressions in production.
LLM Engineering (9): Prompting at Production Scale
Chain-of-thought when it actually helps, self-consistency, prompt-caching economics, jailbreak taxonomy, prompt-injection defenses, and the prompts that survive in production.
LLM Engineering (8): Retrieval-Augmented Generation
Chunking strategies, dense vs sparse vs hybrid retrieval, reranker selection, the long-context-vs-RAG tradeoff in 2026, and the failure modes that show up at 100K+ documents.
LLM Engineering (7): Function Calling and Tool Use
JSON-mode vs function-mode vs free-form, parallel tool calls, structured-output guarantees with grammars, error recovery patterns, and the agent loops that survive contact with reality.
LLM Engineering (6): Long Context — RoPE, YaRN, Sinks
How RoPE encodes position, why naive extension breaks, NTK-aware and YaRN scaling, ALiBi vs RoPE, attention sinks for streaming, and why 1M-context claims often fail at retrieval.
LLM Engineering (5): Inference Optimization
KV cache mechanics, paged attention, continuous batching, speculative decoding, INT8/INT4/AWQ/GPTQ quantization, and the vLLM vs SGLang vs TensorRT-LLM tradeoffs.
LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF
What SFT, DPO, RLHF, and RLAIF each actually optimize, when reward models fail, KL constraints, the LoRA-vs-full-FT debate, and the production post-training recipes that ship in 2026.
LLM Engineering (3): Pretraining at Scale
Data mixing, deduplication, contamination, μP, FSDP vs ZeRO-3 vs pipeline parallel, the practical 200B-token cliff, and the failure modes that only appear above 1000 GPUs.
LLM Engineering (2): Tokenization Deep Dive
BPE vs SentencePiece vs WordPiece, byte-level fallback, the CJK token-bloat problem, vocabulary expansion costs, and the chat-template tokens that silently shape every model's behavior.
LLM Engineering (1): Architectures from Transformer to MoE
MHA → GQA → MQA, sparse MoE routing in Mixtral and Qwen3-MoE, sliding-window attention, and the state-space alternatives Mamba and RWKV — what each costs and where each wins.











