LLM Engineering (5): Inference Optimization

Tue, 31 Mar 2026 09:00:00 +0000

Inference is where the money goes. A single 70B-class model serving 1000 concurrent users at 50 tok/s consumes the GPU budget used to train the model in about 3 months. This chapter focuses on two key metrics: time-to-first-token (TTFT) and inter-token latency (ITL), and one ratio: GPU-seconds per million output tokens.

Training is a one-time capital expense, with costs spread over millions of inference calls. Inference, however, is a recurring operating expense that doesn’t amortize. A 50% (1.5x) improvement in tokens-per-GPU-second compounds daily over the product’s lifetime. That’s why every serious LLM team has at least one full-time engineer focused on inference, and why the open-source community has released four distinct waves of inference engines (FasterTransformer → DeepSpeed-Inference → vLLM → SGLang/TensorRT-LLM/llama.cpp) in five years.

Paged-Attention on Chen Kai Blog

LLM Engineering (5): Inference Optimization