Alibaba Cloud Full Stack (7): SLS, CloudMonitor, and Observability

Mon, 04 May 2026 09:00:00 +0000

The worst production outage I ever caused took three hours to diagnose. A Node.js service was returning 502s intermittently — maybe 5% of requests — and I had nothing. No centralized logs (each ECS instance had its own /var/log/ and I was SSH-ing into them one at a time). No metrics dashboards (I was running top and df -h in terminals). No tracing (I was adding console.log timestamps to try to figure out which downstream call was hanging). Three hours later, I found the issue: a connection pool to RDS was exhausting under load because a forgotten cron job was holding connections open. The fix was two lines of code. The diagnosis took three hours of misery because I had zero observability.

Terraform for AI Agents (7): Observability, SLS Dashboards, and Cost Alarms

Tue, 24 Mar 2026 09:00:00 +0000

Agents are non-deterministic, multi-step, and call expensive APIs. This combination means you can’t debug them after the fact unless you instrumented them from the start. This article sets up three pipelines through Terraform — logs, traces, and metrics — into a unified dashboard, adds six SLS queries to solve real incidents, and sets up four alarms that have actually fired and saved my projects in production.

By the end, you’ll have a DingTalk channel that alerts you before the bill explodes, latency increases, the error rate spikes, or an agent starts looping on itself — plus SLO budgets that turn operational feelings into data.

CloudMonitor on Chen Kai Blog

Alibaba Cloud Full Stack (7): SLS, CloudMonitor, and Observability

Terraform for AI Agents (7): Observability, SLS Dashboards, and Cost Alarms