<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Monitoring on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/monitoring/</link><description>Recent content in Monitoring on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Tue, 07 Apr 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/monitoring/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Engineering (12): Production — Deployment, Monitoring, Cost</title><link>https://www.chenk.top/en/llm-engineering/12-production/</link><pubDate>Tue, 07 Apr 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/llm-engineering/12-production/</guid><description>&lt;p>This is the last chapter. The previous ones covered building the model, the prompt, the retrieval, and the evaluation. This chapter focuses on maintaining it without breaking the bank. Production LLM serving is more like running a high-traffic web service than classical ML serving, except each web request costs money and can take up to two minutes.&lt;/p>
&lt;p>I&amp;rsquo;ll focus more on numbers here than in earlier chapters. In production, the difference between a profitable feature and a money pit often boils down to a 2-5x cost factor that no one is tracking. The most useful skill to develop is back-of-the-envelope cost arithmetic for LLM workloads. The numbers below are accurate as of late 2025 / early 2026; verify them against current pricing before committing.&lt;/p></description></item><item><title>Databases (8): Databases in Practice — Migration, Monitoring, and War Stories</title><link>https://www.chenk.top/en/databases/08-database-in-practice/</link><pubDate>Tue, 30 Apr 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/databases/08-database-in-practice/</guid><description>&lt;p>Knowing how databases work internally is half the battle. The other half is keeping them running in production without losing data, dropping availability, or waking up at 3 AM. This article covers the operational knowledge that comes from experience — the things nobody teaches you until something breaks.&lt;/p>
&lt;hr>
&lt;h2 id="schema-migrations-changing-the-engine-while-flying" class="heading-anchor">Schema Migrations: Changing the Engine While Flying&lt;a href="#schema-migrations-changing-the-engine-while-flying" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>Your schema will change. New features require new columns, new tables, new indexes. The question is how to evolve the schema without downtime.&lt;/p></description></item><item><title>Cloud Computing (7): Cloud Operations and DevOps Practices</title><link>https://www.chenk.top/en/cloud-computing/operations-devops/</link><pubDate>Fri, 26 May 2023 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/cloud-computing/operations-devops/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/cloud-computing/operations-devops/illustration_1.png" alt="Cloud Computing (7): Cloud Operations and DevOps Practices — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;p>In 2017 GitLab lost six hours of database state. An engineer, exhausted, ran &lt;code>rm -rf&lt;/code> on the wrong server during an incident. The backup procedures had silently been broken for months; nobody noticed because no one was restoring from backups. The lesson is not &amp;ldquo;be careful with rm&amp;rdquo;. The lesson is that operations is a &lt;em>system&lt;/em> — tools, runbooks, monitoring, automation, and the rituals around them. When the system is healthy, no single tired engineer can take down production. When the system is rotten, every late-night fix is one keystroke from disaster.&lt;/p></description></item></channel></rss>