Series

LLM Engineering

End-to-end modern LLM stack: architectures, post-training, inference, RAG, evaluation, safety, and production.

Apr 7, 2026 LLM Engineering 36 min read

LLM Engineering (12): Production — Deployment, Monitoring, Cost

Serving stack choices in detail, autoscaling LLMs, latency budgets, prompt+completion cost tracking, multi-model routing, FrugalGPT cascading, observability you need from day one, and the on-call patterns that work.

Apr 6, 2026 LLM Engineering 36 min read

LLM Engineering (11): Safety and Alignment

What alignment means engineering-wise, refusal calibration, the red-team taxonomy, hallucination metrics, sleeper agents, refusal as a feature vector, constitutional AI, and what shipping safely actually requires in …

Apr 5, 2026 LLM Engineering 36 min read

LLM Engineering (10): Evaluation

Why MMLU is broken, the contamination problem, LLM-as-judge biases, position-bias mitigation, calibration, and the A/B testing patterns that actually catch regressions in production.

Apr 4, 2026 LLM Engineering 40 min read

LLM Engineering (9): Prompting at Production Scale

Chain-of-thought when it actually helps, self-consistency, prompt-caching economics, jailbreak taxonomy, prompt-injection defenses, and the prompts that survive in production.

Apr 3, 2026 LLM Engineering 34 min read

LLM Engineering (8): Retrieval-Augmented Generation

Chunking strategies, dense vs sparse vs hybrid retrieval, reranker selection, the long-context-vs-RAG tradeoff in 2026, and the failure modes that show up at 100K+ documents.

Apr 2, 2026 LLM Engineering 34 min read

LLM Engineering (7): Function Calling and Tool Use

JSON-mode vs function-mode vs free-form, parallel tool calls, structured-output guarantees with grammars, error recovery patterns, and the agent loops that survive contact with reality.

Apr 1, 2026 LLM Engineering 34 min read

LLM Engineering (6): Long Context — RoPE, YaRN, Sinks

How RoPE encodes position, why naive extension breaks, NTK-aware and YaRN scaling, ALiBi vs RoPE, attention sinks for streaming, and why 1M-context claims often fail at retrieval.

Mar 31, 2026 LLM Engineering 42 min read

LLM Engineering (5): Inference Optimization

KV cache mechanics, paged attention, continuous batching, speculative decoding, INT8/INT4/AWQ/GPTQ quantization, and the vLLM vs SGLang vs TensorRT-LLM tradeoffs.

Mar 30, 2026 LLM Engineering 52 min read

LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF

What SFT, DPO, RLHF, and RLAIF each actually optimize, when reward models fail, KL constraints, the LoRA-vs-full-FT debate, and the production post-training recipes that ship in 2026.

Mar 29, 2026 LLM Engineering 42 min read

LLM Engineering (3): Pretraining at Scale

Data mixing, deduplication, contamination, μP, FSDP vs ZeRO-3 vs pipeline parallel, the practical 200B-token cliff, and the failure modes that only appear above 1000 GPUs.

Mar 28, 2026 LLM Engineering 18 min read

LLM Engineering (2): Tokenization Deep Dive

BPE vs SentencePiece vs WordPiece, byte-level fallback, the CJK token-bloat problem, vocabulary expansion costs, and the chat-template tokens that silently shape every model's behavior.

Mar 27, 2026 LLM Engineering 56 min read

LLM Engineering (1): Architectures from Transformer to MoE

MHA → GQA → MQA, sparse MoE routing in Mixtral and Qwen3-MoE, sliding-window attention, and the state-space alternatives Mamba and RWKV — what each costs and where each wins.