<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>CloudMonitor on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/cloudmonitor/</link><description>Recent content in CloudMonitor on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Mon, 04 May 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/cloudmonitor/index.xml" rel="self" type="application/rss+xml"/><item><title>Alibaba Cloud Full Stack (7): SLS, CloudMonitor, and Observability</title><link>https://www.chenk.top/en/aliyun-fullstack/07-observability/</link><pubDate>Mon, 04 May 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/aliyun-fullstack/07-observability/</guid><description>&lt;p>The worst production outage I ever caused took three hours to diagnose. A Node.js service was returning 502s intermittently — maybe 5% of requests — and I had nothing. No centralized logs (each ECS instance had its own &lt;code>/var/log/&lt;/code> and I was SSH-ing into them one at a time). No metrics dashboards (I was running &lt;code>top&lt;/code> and &lt;code>df -h&lt;/code> in terminals). No tracing (I was adding &lt;code>console.log&lt;/code> timestamps to try to figure out which downstream call was hanging). Three hours later, I found the issue: a connection pool to RDS was exhausting under load because a forgotten cron job was holding connections open. The fix was two lines of code. The diagnosis took three hours of misery because I had zero observability.&lt;/p></description></item><item><title>Terraform for AI Agents (7): Observability, SLS Dashboards, and Cost Alarms</title><link>https://www.chenk.top/en/terraform-agents/07-observability-and-cost-control/</link><pubDate>Tue, 24 Mar 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/terraform-agents/07-observability-and-cost-control/</guid><description>&lt;p>Agents are non-deterministic, multi-step, and call expensive APIs. This combination means you can&amp;rsquo;t debug them after the fact unless you instrumented them from the start. This article sets up three pipelines through Terraform — logs, traces, and metrics — into a unified dashboard, adds six SLS queries to solve real incidents, and sets up four alarms that have actually fired and saved my projects in production.&lt;/p>
&lt;p>By the end, you&amp;rsquo;ll have a DingTalk channel that alerts you before the bill explodes, latency increases, the error rate spikes, or an agent starts looping on itself — plus SLO budgets that turn operational feelings into data.&lt;/p></description></item></channel></rss>