Monitoring on Chen Kai Blog

LLM Engineering (12): Production — Deployment, Monitoring, Cost

Tue, 07 Apr 2026 09:00:00 +0000

This is the last chapter. The previous ones covered building the model, the prompt, the retrieval, and the evaluation. This chapter focuses on maintaining it without breaking the bank. Production LLM serving is more like running a high-traffic web service than classical ML serving, except each web request costs money and can take up to two minutes.

I’ll focus more on numbers here than in earlier chapters. In production, the difference between a profitable feature and a money pit often boils down to a 2-5x cost factor that no one is tracking. The most useful skill to develop is back-of-the-envelope cost arithmetic for LLM workloads. The numbers below are accurate as of late 2025 / early 2026; verify them against current pricing before committing.

Databases (8): Databases in Practice — Migration, Monitoring, and War Stories

Tue, 30 Apr 2024 09:00:00 +0000

Knowing how databases work internally is half the battle. The other half is keeping them running in production without losing data, dropping availability, or waking up at 3 AM. This article covers the operational knowledge that comes from experience — the things nobody teaches you until something breaks.

Schema Migrations: Changing the Engine While Flying#

Your schema will change. New features require new columns, new tables, new indexes. The question is how to evolve the schema without downtime.

Cloud Computing (7): Cloud Operations and DevOps Practices

Fri, 26 May 2023 09:00:00 +0000

In 2017 GitLab lost six hours of database state. An engineer, exhausted, ran rm -rf on the wrong server during an incident. The backup procedures had silently been broken for months; nobody noticed because no one was restoring from backups. The lesson is not “be careful with rm”. The lesson is that operations is a system — tools, runbooks, monitoring, automation, and the rituals around them. When the system is healthy, no single tired engineer can take down production. When the system is rotten, every late-night fix is one keystroke from disaster.