Terraform for AI Agents (7): Observability, SLS Dashboards, and Cost Alarms
Logs to SLS, traces to ARMS, metrics to CloudMonitor — all provisioned in HCL so a new env comes pre-instrumented. The four alarms that actually catch real incidents and the SLS-driven cost dashboard that tells you which agent is burning your budget before payday.
Agents are non-deterministic, multi-step, and call expensive APIs. The combination means you cannot debug them after the fact unless you instrumented them on day one. This article wires three pipelines through Terraform — logs, traces, metrics — into a unified dashboard, then layers four alarms that have actually fired and saved my projects in production.
By the end you have one DingTalk channel that pings before the bill explodes, the latency dies, the error rate spikes, or some agent starts looping on itself.
The three pipelines
Three signal types, three Aliyun services, all converging on SLS for human-friendly viewing:
Logs — agent stdout/stderr → Logtail agent on the ECS → SLS Logstore
Traces — OpenTelemetry SDK in the agent code → ARMS APM (which is OpenTelemetry-compatible)
Metrics — host metrics from CloudMonitor agent + custom metrics from agent code → CloudMonitor → optionally piped to SLS
Don’t pick “just logs” or “just metrics”. You need all three:
Logs answer “what did the agent do?”
Traces answer “where did the time go?”
Metrics answer “is this happening more often than usual?”
Step 1: SLS project and logstores
Everything observability-related starts with one SLS project. One per environment is right; one per agent is too granular.
agent-runs — every step of every agent run (the firehose)
gateway-requests — one row per LLM API call, with model, tokens, latency, cost
ecs-syslog — the underlying OS logs from the ECS instances
ack-cluster — Kubernetes events and pod logs (only if using ACK)
audit — every change Terraform makes, retained for a year for compliance
The audit store has a year retention because it’s small and you need it years later when “who changed the prod ALB on March 12” comes up.
Step 2: ship logs from ECS
The Logtail agent is the official Aliyun-side log collector. Install it via cloud-init (add to cloud-init.sh from article 4):
1
2
3
4
5
6
7
# Install Logtailwget http://logtail-release-cn-shanghai.oss-cn-shanghai.aliyuncs.com/linux64/logtail.sh
chmod +x logtail.sh && ./logtail.sh install cn-shanghai
service ilogtaild start
# Tag this machine for the SLS machine groupecho"${sls_user_id}::${sls_machine_group}" > /etc/ilogtail/user_log_config.json
The Logtail config — what files to tail, how to parse them — is a Terraform resource:
Now any file matching /var/log/agents/*.log on any tagged machine flows into SLS as JSON, queryable by every field. The agent code just does logger.info(json.dumps({...})) and the rest is automatic.
Step 3: traces via OpenTelemetry → ARMS
For traces, ARMS APM is OpenTelemetry-compatible. The Terraform side is small — provision an ARMS instance and an environment:
The agent code uses standard OpenTelemetry — nothing Aliyun-specific:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
fromopentelemetryimporttracefromopentelemetry.exporter.otlp.proto.http.trace_exporterimportOTLPSpanExporterfromopentelemetry.sdk.traceimportTracerProviderfromopentelemetry.sdk.trace.exportimportBatchSpanProcessortrace.set_tracer_provider(TracerProvider())trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=os.environ["ARMS_OTLP_ENDPOINT"],headers={"Authentication":os.environ["ARMS_LICENSE_KEY"]},)))tracer=trace.get_tracer("research-agent")withtracer.start_as_current_span("research_loop")asspan:span.set_attribute("agent.name","research-agent")span.set_attribute("session.id",session_id)# ... agent work ...
The two env vars come from ARMS — ARMS_OTLP_ENDPOINT is in the ARMS console, ARMS_LICENSE_KEY from your account. Wire both via Terraform outputs into the cloud-init template.
The reward: in ARMS you can see “this agent run took 12s; 9s of it was the third LLM call to qwen-max.” That’s the kind of visibility that actually changes how you build agents.
Step 4: metrics with CloudMonitor
CloudMonitor catches the host-level metrics automatically (CPU, memory, network) once you install the cloud-monitor agent — which the install_cloud_monitor flag on the ACK node pool already does. For ECS, add to cloud-init:
For custom application metrics — “tokens consumed by research-agent” — emit them as SLS log entries with structured fields, then alert via SLS query. SLS-as-metrics is the pattern Aliyun pushes; CloudMonitor custom metrics work too but are clunkier to wire from Terraform.
Step 5: the cost dashboard
Here’s where it gets interesting. Every LLM request hits the gateway, and the gateway logs one row per request to gateway-requests with fields like:
resource"alicloud_log_dashboard" "cost" {
project_name=alicloud_log_project.agents.name dashboard_name="agent-cost-overview" display_name="Agent cost overview" char_list=jsonencode([ {
title="Daily cost by agent" type="line" query="* | SELECT date_trunc('day', __time__) AS day, agent, SUM(cost_cny) AS cost FROM log GROUP BY day, agent ORDER BY day" logstore=alicloud_log_store.this["gateway-requests"].name display= { xAxis= ["day"], yAxis= ["cost"], yKey="agent" }
}, {
title="Tokens by model (last 24h)" type="pie" query="* | SELECT model, SUM(input_tokens + output_tokens) AS tokens FROM log WHERE __time__ > now() - INTERVAL '24' HOUR GROUP BY model" logstore=alicloud_log_store.this["gateway-requests"].name }, {
title="p95 latency by agent" type="line" query="* | SELECT date_trunc('hour', __time__) AS hour, agent, approx_percentile(latency_ms, 0.95) AS p95 FROM log GROUP BY hour, agent ORDER BY hour" logstore=alicloud_log_store.this["gateway-requests"].name }
])}
Open the SLS console and you have a live dashboard:
The dashboard is the answer to “which agent is burning my budget?” — a question you will be asked monthly.
Step 6: the four alarms
Four alarms have earned their keep across multiple agent stacks I’ve shipped:
This fires once per 30 minutes if the day’s LLM spend so far is above ¥800. Tune the threshold to your real budget. The throttling matters — without it, the alert fires every 5 minutes and the team mutes the channel.
notify_threshold = 3 means three consecutive minutes above the threshold before firing — kills the noise from one-off slow LLM calls.
Alarm 3: error rate
Same shape, query is SUM(IF(status >= 500, 1, 0)) * 1.0 / COUNT(*) AS err_ratio, condition err_ratio > 0.02. Throttling is shorter (5 minutes) because errors are usually a real ongoing event.
resource"alicloud_log_alert" "token_spike" {
project_name=alicloud_log_project.agents.name alert_name="token-anomaly" alert_displayname="Tokens/min > 2x rolling 24h avg"query_list {
chart_title="current" logstore=alicloud_log_store.this["gateway-requests"].name query="* | SELECT SUM(input_tokens + output_tokens) AS now_tpm FROM log WHERE __time__ > now() - INTERVAL '1' MINUTE" start="-1m" end="now" }
query_list {
chart_title="baseline" logstore=alicloud_log_store.this["gateway-requests"].name query="* | SELECT AVG(per_min) AS baseline FROM (SELECT date_trunc('minute', __time__) AS m, SUM(input_tokens + output_tokens) AS per_min FROM log WHERE __time__ > now() - INTERVAL '24' HOUR GROUP BY m)" start="-24h" end="now" }
condition="now_tpm > 2 * baseline" schedule_interval="1m" notify_threshold=2 throttling="10m"notification_list {
type="DingTalk" service_uri=var.dingtalk_webhook content="Token consumption ${"{{now_tpm}}"} tpm vs 24h avg ${"{{baseline}}"}. Possible runaway agent." }
}
This is the one that has paid for itself. An agent with a buggy stop condition can burn ¥10,000 in tokens overnight; this alert catches it within 2 minutes and gives you time to kill the offender.
Why DingTalk?
In China, DingTalk is the team-chat default for most engineering orgs. SLS supports DingTalk webhooks natively. You can also fan out to email, SMS, and (via webhook) Slack/Teams/Lark. Pick whatever your team checks at 2am.
What about ARMS-side alerts?
ARMS has its own alerting — useful for trace-level conditions (“any trace with > 30 spans”). For the four alarms above, SLS-side is enough and avoids splitting your alerting story across two systems. Use ARMS alerts only when SLS can’t express what you need.
What it costs
Observability has a real cost — usually 10-15% of the rest of your bill:
SLS: ~¥0.35/GB ingested + ¥0.15/GB stored. A medium-traffic agent stack ingests ~5 GB/day → ¥50/mo for ingest, ¥20/mo for 30-day retention
ARMS APM: ~¥600/mo for 1 environment with up to 100M spans
CloudMonitor: free for standard metrics, ¥0.005 per custom metric per day
Budget ¥1000-1500/mo for full observability on a real production agent stack. Cheap compared to one missed cost-runaway alarm.
What’s next
Article 8 is the end-to-end walkthrough. We compose every module from articles 2-7 — vpc-baseline, compute, storage, gateway, observability — into one research-agent-stack project and watch it come up with a single terraform apply. Real apply output, real timing, the full module DAG. The starter repo at the end is yours to fork.
Liked this piece?
Follow on GitHub for the next one — usually one a week.