<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Autoscaling on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/autoscaling/</link><description>Recent content in Autoscaling on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Tue, 07 Apr 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/autoscaling/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Engineering (12): Production — Deployment, Monitoring, Cost</title><link>https://www.chenk.top/en/llm-engineering/12-production/</link><pubDate>Tue, 07 Apr 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/llm-engineering/12-production/</guid><description>&lt;p>This is the last chapter. The previous ones covered building the model, the prompt, the retrieval, and the evaluation. This chapter focuses on maintaining it without breaking the bank. Production LLM serving is more like running a high-traffic web service than classical ML serving, except each web request costs money and can take up to two minutes.&lt;/p>
&lt;p>I&amp;rsquo;ll focus more on numbers here than in earlier chapters. In production, the difference between a profitable feature and a money pit often boils down to a 2-5x cost factor that no one is tracking. The most useful skill to develop is back-of-the-envelope cost arithmetic for LLM workloads. The numbers below are accurate as of late 2025 / early 2026; verify them against current pricing before committing.&lt;/p></description></item></channel></rss>