用 Terraform 给 AI Agent 上云(七):可观测、SLS 看板与成本告警
日志进 SLS、Trace 进 ARMS、指标进 CloudMonitor——全部用 HCL 配,新环境天生带观测。真实救过我项目的四条告警,加上 SLS 驱动的成本看板,发薪日之前告诉你哪个 Agent 在烧预算。
CK
Chen Kai
· 7 min read · 3467 words
Agent 是非确定的、多步的、调昂贵 API 的。这组合意味着如果你不在第一天 instrument 它,事后没法 debug。本篇用 Terraform 打通三条管道——日志、Trace、指标——汇成一个统一看板,再叠四条真正在生产环境救过我项目的告警。
读完之后你拥有一个钉钉群,账单爆掉之前、延迟挂掉之前、错误率飙升之前、某个 Agent 自循环之前,它都会先 ping 你。
三条管道

三种信号、三个阿里云服务,最终都汇到 SLS 给人类看:
- 日志——Agent stdout/stderr → ECS 上的 Logtail → SLS Logstore
- Trace——Agent 代码里的 OpenTelemetry SDK → ARMS APM(OTel 兼容)
- 指标——CloudMonitor agent 的主机指标 + Agent 代码自定义指标 → CloudMonitor → 可选转发到 SLS
不要"只要日志"或"只要指标"。三个都要:
- 日志回答"Agent 做了什么?"
- Trace 回答"时间花到哪去了?"
- 指标回答"这件事比平时频率高吗?"
第 1 步:SLS 项目和 logstore
所有可观测的东西从一个 SLS 项目开始。每环境一个项目是对的;每 Agent 一个就太碎了。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| resource "alicloud_log_project" "agents" {
name = "agents-${terraform.workspace}"
description = "agents-${terraform.workspace} 的日志和指标"
tags = {
Environment = terraform.workspace
ManagedBy = "terraform"
}
}
locals {
logstores = {
"agent-runs" = { ttl = 30, shard_count = 4 }
"gateway-requests" = { ttl = 90, shard_count = 4 }
"ecs-syslog" = { ttl = 14, shard_count = 2 }
"ack-cluster" = { ttl = 30, shard_count = 4 }
"audit" = { ttl = 365, shard_count = 2 }
}
}
resource "alicloud_log_store" "this" {
for_each = local.logstores
project = alicloud_log_project.agents.name
name = each.key
shard_count = each.value.shard_count
retention_period = each.value.ttl
auto_split = true
max_split_shard_count = 16
encrypt_conf {
enable = true
encrypt_type = "default"
user_cmk_info {
cmk_key_id = module.vpc.kms_keys["logs"]
arn = module.vpc.kms_keys["logs"]
region_id = "cn-shanghai"
}
}
}
|
五个 logstore 覆盖实际需要:
agent-runs——每个 Agent 每一步(消防水管)gateway-requests——每次 LLM API 调用一行,带 model、tokens、latency、costecs-syslog——ECS 实例底层 OS 日志ack-cluster——Kubernetes 事件和 Pod 日志(仅 ACK 时)audit——Terraform 的每次变更,留 1 年合规
audit 留一年因为它体积小,几年后"3 月 12 日是谁改了 prod ALB"会问到。
第 2 步:从 ECS 推日志
Logtail 是阿里云官方日志采集器。cloud-init 里装(加进第四篇的 cloud-init.sh):
1
2
3
4
5
6
7
| # 装 Logtail
wget http://logtail-release-cn-shanghai.oss-cn-shanghai.aliyuncs.com/linux64/logtail.sh
chmod +x logtail.sh && ./logtail.sh install cn-shanghai
service ilogtaild start
# 给这台机打标,挂到 SLS machine group
echo "${sls_user_id}::${sls_machine_group}" > /etc/ilogtail/user_log_config.json
|
Logtail 配置——抓哪些文件、怎么解析——是 Terraform resource:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| resource "alicloud_log_machine_group" "agent" {
project = alicloud_log_project.agents.name
name = "agent-runtime-machines"
identify_type = "userdefined"
identify_list = ["agent-runtime-${terraform.workspace}"]
topic = "agent-runs"
}
resource "alicloud_logtail_config" "agent" {
project = alicloud_log_project.agents.name
logstore = alicloud_log_store.this["agent-runs"].name
input_type = "file"
log_sample = <<-SAMPLE
{"ts":"2026-03-24T09:15:23Z","agent":"research","step":"plan","tokens":420,"latency_ms":1200}
SAMPLE
name = "agent-runs-collector"
output_type = "LogService"
input_detail = jsonencode({
logType = "json_log"
logPath = "/var/log/agents"
filePattern = "*.log"
localStorage = true
enableRawLog = false
timeKey = "ts"
timeFormat = "%Y-%m-%dT%H:%M:%S%z"
discardUnmatch = false
maxDepth = 10
})
}
resource "alicloud_logtail_attachment" "agent" {
project = alicloud_log_project.agents.name
logtail_config_name = alicloud_logtail_config.agent.name
machine_group_name = alicloud_log_machine_group.agent.name
}
|
现在任何打了标的机器上 /var/log/agents/*.log 都流进 SLS,作为 JSON 按字段可查。Agent 代码只 logger.info(json.dumps({...})),其余自动。
第 3 步:OpenTelemetry → ARMS Trace
Trace 用 ARMS APM,OpenTelemetry 兼容。Terraform 一侧很小——开 ARMS 实例和环境:
1
2
3
4
5
6
7
| resource "alicloud_arms_environment" "agents" {
environment_name = "agents-${terraform.workspace}"
bind_resource_id = module.vpc.vpc_id
environment_type = "CS" # cloud service
environment_sub_type = "ECS"
payment_type = "POSTPAY"
}
|
Agent 代码用标准 OpenTelemetry——不用阿里云特定 SDK:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(
endpoint=os.environ["ARMS_OTLP_ENDPOINT"],
headers={"Authentication": os.environ["ARMS_LICENSE_KEY"]},
))
)
tracer = trace.get_tracer("research-agent")
with tracer.start_as_current_span("research_loop") as span:
span.set_attribute("agent.name", "research-agent")
span.set_attribute("session.id", session_id)
# ... Agent 工作 ...
|
两个环境变量来自 ARMS——ARMS_OTLP_ENDPOINT 在 ARMS 控制台,ARMS_LICENSE_KEY 来自账号。两个都通过 Terraform 输出接入 cloud-init 模板。
回报:在 ARMS 里你能看到"这次 Agent run 12 秒;其中 9 秒在第三次调 qwen-max"。这种可见性真正改变你怎么造 Agent。
第 4 步:CloudMonitor 拿指标
CloudMonitor 装 cloud-monitor agent 之后自动拿主机级指标(CPU、内存、网络)——ACK node pool 上 install_cloud_monitor flag 已经做了。ECS 上加进 cloud-init:
1
2
| wget http://cms-agent-cn-shanghai.oss-cn-shanghai.aliyuncs.com/release/cms_go_agent_install.sh
chmod +x cms_go_agent_install.sh && ./cms_go_agent_install.sh
|
应用级自定义指标——“research-agent 消耗的 token”——以结构化字段写成 SLS 日志,再用 SLS query 告警。SLS-as-metrics 是阿里云推的模式;CloudMonitor 自定义指标也可以,但从 Terraform 配比较别扭。
第 5 步:成本看板
这里开始有意思。每次 LLM 请求打到网关,网关每请求一行写到 gateway-requests,字段如:
1
2
3
4
5
6
7
8
9
| {
"ts": "2026-03-24T09:15:23Z",
"agent": "research-agent",
"model": "qwen-max",
"input_tokens": 1820,
"output_tokens": 412,
"latency_ms": 1230,
"cost_cny": 0.087
}
|
SLS 能跑 SQL。“按 Agent 看每日成本"的查询:
1
2
3
4
5
6
| * | SELECT date_trunc('day', __time__) AS day,
agent,
SUM(cost_cny) AS daily_cost
FROM log
GROUP BY day, agent
ORDER BY day, daily_cost DESC
|
通过 Terraform 配看板:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| resource "alicloud_log_dashboard" "cost" {
project_name = alicloud_log_project.agents.name
dashboard_name = "agent-cost-overview"
display_name = "Agent 成本总览"
char_list = jsonencode([
{
title = "按 Agent 看每日成本"
type = "line"
query = "* | SELECT date_trunc('day', __time__) AS day, agent, SUM(cost_cny) AS cost FROM log GROUP BY day, agent ORDER BY day"
logstore = alicloud_log_store.this["gateway-requests"].name
display = { xAxis = ["day"], yAxis = ["cost"], yKey = "agent" }
},
{
title = "近 24 小时按模型看 token"
type = "pie"
query = "* | SELECT model, SUM(input_tokens + output_tokens) AS tokens FROM log WHERE __time__ > now() - INTERVAL '24' HOUR GROUP BY model"
logstore = alicloud_log_store.this["gateway-requests"].name
},
{
title = "按 Agent 看 p95 延迟"
type = "line"
query = "* | SELECT date_trunc('hour', __time__) AS hour, agent, approx_percentile(latency_ms, 0.95) AS p95 FROM log GROUP BY hour, agent ORDER BY hour"
logstore = alicloud_log_store.this["gateway-requests"].name
}
])
}
|
打开 SLS 控制台你就有一个实时看板:

看板就是"哪个 Agent 在烧我的预算?“的答案——一个你每月会被问的问题。
第 6 步:四条告警
四条告警在我交付过的多个 Agent stack 里挣得了它们的位置:

告警 1:成本上限
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| resource "alicloud_log_alert" "cost_ceiling" {
project_name = alicloud_log_project.agents.name
alert_name = "daily-cost-ceiling"
alert_displayname = "每日 LLM 花费 > ¥800"
query_list {
chart_title = "today_cost"
logstore = alicloud_log_store.this["gateway-requests"].name
query = "* | SELECT SUM(cost_cny) AS today FROM log WHERE __time__ > to_unixtime(date_trunc('day', now()))"
start = "-1m"
end = "now"
time_span_type = "Truncated"
}
condition = "today > 800"
schedule_interval = "5m"
notify_threshold = 1
throttling = "30m"
notification_list {
type = "DingTalk"
service_uri = var.dingtalk_webhook
content = "今日 LLM 成本 ¥${"{{today}}"} 超 ¥800 预算。看 SLS 成本看板。"
}
severity_configurations {
severity = 8
eval_condition = { condition = "today > 800" }
}
}
|
如果当日 LLM 花费过 ¥800,每 30 分钟响一次。按你的真实预算调阈值。throttling 重要——不设的话每 5 分钟响一次,团队会把群消息免打扰。
告警 2:延迟
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| resource "alicloud_log_alert" "latency" {
project_name = alicloud_log_project.agents.name
alert_name = "agent-step-latency"
alert_displayname = "p95 Agent step > 8s"
query_list {
chart_title = "p95_step"
logstore = alicloud_log_store.this["agent-runs"].name
query = "* | SELECT approx_percentile(latency_ms, 0.95) / 1000.0 AS p95s FROM log WHERE __time__ > now() - INTERVAL '5' MINUTE"
start = "-5m"
end = "now"
}
condition = "p95s > 8"
schedule_interval = "1m"
notify_threshold = 3
throttling = "15m"
notification_list {
type = "DingTalk"
service_uri = var.dingtalk_webhook
content = "Agent p95 step 延迟 ${"{{p95s}}"}s——用户发现之前去看。"
}
}
|
notify_threshold = 3 意味着连续三分钟超阈值才响——把单次慢 LLM 调用的噪声压住。
告警 3:错误率
形状一样,query 是 SUM(IF(status >= 500, 1, 0)) * 1.0 / COUNT(*) AS err_ratio,condition err_ratio > 0.02。throttling 短一些(5 分钟)因为错误通常是真正持续的事件。
告警 4:token 漏(失控循环)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| resource "alicloud_log_alert" "token_spike" {
project_name = alicloud_log_project.agents.name
alert_name = "token-anomaly"
alert_displayname = "Tokens/min > 24h 滚动均值的 2 倍"
query_list {
chart_title = "current"
logstore = alicloud_log_store.this["gateway-requests"].name
query = "* | SELECT SUM(input_tokens + output_tokens) AS now_tpm FROM log WHERE __time__ > now() - INTERVAL '1' MINUTE"
start = "-1m"
end = "now"
}
query_list {
chart_title = "baseline"
logstore = alicloud_log_store.this["gateway-requests"].name
query = "* | SELECT AVG(per_min) AS baseline FROM (SELECT date_trunc('minute', __time__) AS m, SUM(input_tokens + output_tokens) AS per_min FROM log WHERE __time__ > now() - INTERVAL '24' HOUR GROUP BY m)"
start = "-24h"
end = "now"
}
condition = "now_tpm > 2 * baseline"
schedule_interval = "1m"
notify_threshold = 2
throttling = "10m"
notification_list {
type = "DingTalk"
service_uri = var.dingtalk_webhook
content = "Token 消耗 ${"{{now_tpm}}"} tpm vs 24h 均 ${"{{baseline}}"}。可能 Agent 失控。"
}
}
|
这是最值回票价的那条。一个停止条件有 bug 的 Agent 一夜之间能烧 ¥10,000 token;这条告警 2 分钟内抓到,给你时间杀进程。
为什么钉钉?
国内多数工程团队默认钉钉。SLS 原生支持钉钉 webhook。也能扇出到邮件、短信、(通过 webhook)Slack/Teams/飞书。挑你团队凌晨两点会看的那个。
ARMS 那边的告警呢?
ARMS 自己也有告警——对 trace 级条件有用(“任何超过 30 个 span 的 trace”)。上面四条 SLS 端就够了,且避免把告警故事拆到两个系统。SLS 表达不了的需求才用 ARMS 告警。
成本
可观测有真实成本——通常是其他账单的 10-15%:
- SLS:~¥0.35/GB ingest + ¥0.15/GB 存储。中等流量 Agent stack 每天 ingest ~5 GB → ¥50/月 ingest,30 天保留 ¥20/月
- ARMS APM:~¥600/月,1 个环境最多 100M span
- CloudMonitor:标准指标免费,自定义指标 ¥0.005/指标/天
完整可观测在生产 Agent stack 上预算 ¥1000-1500/月。比一次错过的成本失控告警便宜得多。
下一篇
第八篇是端到端 walkthrough。我们把第二到第七篇的所有 module——vpc-baseline、compute、storage、gateway、observability——组合成一个 research-agent-stack 项目,看它从一次 terraform apply 中起来。真实的 apply 输出、真实的耗时、完整的 module DAG。最后那个起手仓库可以 fork。
Liked this piece?
Follow on GitHub for the next one — usually one a week.
GitHub →