<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Paged-Attention on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/paged-attention/</link><description>Recent content in Paged-Attention on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Tue, 31 Mar 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/paged-attention/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Engineering (5): Inference Optimization</title><link>https://www.chenk.top/en/llm-engineering/05-inference/</link><pubDate>Tue, 31 Mar 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/llm-engineering/05-inference/</guid><description>&lt;p>Inference is where the money goes. A single 70B-class model serving 1000 concurrent users at 50 tok/s consumes the GPU budget used to train the model in about 3 months. This chapter focuses on two key metrics: time-to-first-token (TTFT) and inter-token latency (ITL), and one ratio: GPU-seconds per million output tokens.&lt;/p>
&lt;p>Training is a one-time capital expense, with costs spread over millions of inference calls. Inference, however, is a recurring operating expense that doesn&amp;rsquo;t amortize. A 50% (1.5x) improvement in tokens-per-GPU-second compounds daily over the product&amp;rsquo;s lifetime. That&amp;rsquo;s why every serious LLM team has at least one full-time engineer focused on inference, and why the open-source community has released four distinct waves of inference engines (FasterTransformer → DeepSpeed-Inference → vLLM → SGLang/TensorRT-LLM/llama.cpp) in five years.&lt;/p></description></item></channel></rss>