LLM Engineering on Chen Kai Blog

LLM Engineering (12): Production — Deployment, Monitoring, Cost

Tue, 07 Apr 2026 09:00:00 +0000

This is the last chapter. The previous ones covered building the model, the prompt, the retrieval, and the evaluation. This chapter focuses on maintaining it without breaking the bank. Production LLM serving is more like running a high-traffic web service than classical ML serving, except each web request costs money and can take up to two minutes.

I’ll focus more on numbers here than in earlier chapters. In production, the difference between a profitable feature and a money pit often boils down to a 2-5x cost factor that no one is tracking. The most useful skill to develop is back-of-the-envelope cost arithmetic for LLM workloads. The numbers below are accurate as of late 2025 / early 2026; verify them against current pricing before committing.

LLM Engineering (11): Safety and Alignment

Mon, 06 Apr 2026 09:00:00 +0000

Safety has the worst signal-to-noise ratio of any topic in LLM engineering. There’s a lot of philosophy, a lot of marketing, and not a lot of engineering specifics. This chapter is the engineering specifics: what RLHF actually optimizes when it talks about “safety,” how refusal calibration breaks, what red-teaming looks like in practice, the hallucination measures that actually predict customer impact, and the small but significant 2024-2026 papers (Sleeper Agents, refusal as a feature direction, weak-to-strong generalization) that should change how you think about alignment in production.

LLM Engineering (10): Evaluation

Sun, 05 Apr 2026 09:00:00 +0000

Evaluation is the part of the LLM stack where everyone has opinions but no one is confident. The leaderboards are gamed, the public benchmarks are contaminated, and most teams I’ve worked with had no eval set when I joined. This chapter covers what evaluation actually tells you, what the benchmarks hide, the LLM-as-judge biases that go unaddressed, the calibration metrics most teams skip, and the production patterns that catch regressions before customers notice.

LLM Engineering (9): Prompting at Production Scale

Sat, 04 Apr 2026 09:00:00 +0000

A prompt that works on 100 examples in a notebook can fail on 10% of inputs in production for reasons unrelated to cleverness. This chapter covers prompting as an engineering task: where chain-of-thought helps (and where it doesn’t), how prompt caching affects costs, how to combine few-shot, chain-of-thought, and self-consistency without using every trick, and how to defend against jailbreaks and injections that production traffic will generate within a week of launch.

LLM Engineering (8): Retrieval-Augmented Generation

Fri, 03 Apr 2026 09:00:00 +0000

RAG is the most over-deployed and under-engineered pattern in LLM applications. The 2024 demo loop — embed everything with text-embedding-3-large, dump into pgvector, top-5 cosine — works for 1000 documents and a forgiving demo. It does not survive 100K real documents and a customer who notices when the answer is wrong. This chapter is what I wish more teams knew before they built their second generation of RAG.

The original RAG paper (Lewis et al., 2020 ) framed retrieval-augmented generation as a hybrid model: a dense retriever (DPR) trained jointly with a generator (BART) so the retrieval objective optimized end-task accuracy. Production RAG in 2026 doesn’t look much like Lewis’s RAG — modern systems use frozen pre-trained embedders, separate rerankers, and decoder-only generators that don’t train against the retriever. But the core insight (parameterize knowledge separately from reasoning) survived and became the dominant paradigm. The Gao et al. (2023) RAG survey is the best comprehensive overview of the post-2020 evolution into “Naive RAG → Advanced RAG → Modular RAG.”

LLM Engineering (7): Function Calling and Tool Use

Thu, 02 Apr 2026 09:00:00 +0000

Function calling connects an LLM to the world outside its weights. It combines chat-template details (Chapter 2 ), structured-output kernels (Chapter 5 ), and prompt engineering (Chapter 9 ). This chapter explores what happens under the hood, the guarantees you can rely on, and the agent-loop patterns that handle real workloads.

The intellectual lineage matters. Tool use as an LLM capability traces back to two near-simultaneous papers in 2022: MRKL Systems (Karpas et al., AI21) which proposed expert-routing among neuro-symbolic modules, and ReAct (Yao et al., 2022 ) which interleaved chain-of-thought reasoning with tool actions. Toolformer (Schick et al., 2023 ) showed self-supervised teaching of tool use, generating training data by having a model insert tool-call markers into existing text. By 2024 every frontier model had post-training data structured around the tool-use format, and tool calling moved from “research demo” to “API feature.”

LLM Engineering (6): Long Context — RoPE, YaRN, Sinks

Wed, 01 Apr 2026 09:00:00 +0000

“1M token context” is one of the most over-claimed numbers in LLMs. A model can attend to 1M tokens — that’s an architecture statement. A model can use information at position 800K to answer a question — that’s a behavior statement, and it’s more challenging. This chapter covers the math of position encoding, the engineering tricks that extend context beyond the training length, and why most long-context claims fail needle-in-a-haystack tests.

LLM Engineering (5): Inference Optimization

Tue, 31 Mar 2026 09:00:00 +0000

Inference is where the money goes. A single 70B-class model serving 1000 concurrent users at 50 tok/s consumes the GPU budget used to train the model in about 3 months. This chapter focuses on two key metrics: time-to-first-token (TTFT) and inter-token latency (ITL), and one ratio: GPU-seconds per million output tokens.

Training is a one-time capital expense, with costs spread over millions of inference calls. Inference, however, is a recurring operating expense that doesn’t amortize. A 50% (1.5x) improvement in tokens-per-GPU-second compounds daily over the product’s lifetime. That’s why every serious LLM team has at least one full-time engineer focused on inference, and why the open-source community has released four distinct waves of inference engines (FasterTransformer → DeepSpeed-Inference → vLLM → SGLang/TensorRT-LLM/llama.cpp) in five years.

LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF

Mon, 30 Mar 2026 09:00:00 +0000

A base model from pretraining can complete text but cannot follow instructions, refuse harmful requests, or maintain a persona—these are post-training behaviors. Post-training is where the gap between a research paper’s claims and a production-grade model lies. This chapter covers what each post-training algorithm optimizes, why most reward models are subtly flawed, and the effective methods for 2026.

LLM Engineering (3): Pretraining at Scale

Sun, 29 Mar 2026 09:00:00 +0000

Pretraining is where most of an LLM’s capability comes from, and it’s also where the leaderboard-vs-reality gap is widest. Most published runs are heroic engineering more than they are scientific results. This chapter is about the parts of pretraining that you actually have to get right when you’re not OpenAI: the data, the parallelism choice, and the failure modes that only show up when the cluster is large enough to make a single bad NCCL all-reduce kill a 30-day run.

LLM Engineering (2): Tokenization Deep Dive

Sat, 28 Mar 2026 09:00:00 +0000

Tokenization is the layer everyone skips. It’s also the layer where I’ve debugged the most production bugs — silent quality regressions, mysterious cost spikes, models refusing to follow instructions because someone formatted the chat template wrong. This chapter is everything I wish I’d internalized before shipping a multilingual product.

What a tokenizer actually does#

A tokenizer maps a string to a list of integer IDs. Reverse maps IDs back to a string. Both directions are deterministic but not bijective in general — round-tripping tokenizer.decode(tokenizer.encode(s)) can lose whitespace, normalize Unicode, or collapse repeated punctuation, depending on the algorithm.

LLM Engineering (1): Architectures from Transformer to MoE

Fri, 27 Mar 2026 09:00:00 +0000

The 2017 Transformer block is still the silhouette of every production LLM in 2026, but almost every internal piece has been swapped, sparsified, or specialized. This series covers the modern stack end to end — architecture, training, inference, retrieval, evaluation, safety, deployment. Chapter 1 is about the block itself: what attention looks like in a 2026 model, how MoE breaks the param-FLOPs link, and where the non-attention alternatives (Mamba, RWKV) actually beat the Transformer.