Terraform-Agents on Chen Kai Blog

Terraform for AI Agents (8): End-to-End — research-agent-stack in One Apply

Thu, 26 Mar 2026 09:00:00 +0000

This is where everything from articles 2 through 7 lands in one place. By the end you’ll have run terraform apply once and produced a complete, observable, budgeted agent runtime stack on Alibaba Cloud — about 31 resources, ~7 minutes of wall clock, ¥12,530/month all-in at prod sizing.

The stack we’re building:

Five layers — edge, compute, memory, platform, ops — composed from the modules we built across this series. Eleven Aliyun products under the hood: VPC, ECS, ALB, OSS, RDS for PostgreSQL, OpenSearch, KMS, SLS, ARMS, CloudMonitor, and DashScope (the LLM provider, accessed via the gateway).

Terraform for AI Agents (7): Observability, SLS Dashboards, and Cost Alarms

Tue, 24 Mar 2026 09:00:00 +0000

Agents are non-deterministic, multi-step, and call expensive APIs. This combination means you can’t debug them after the fact unless you instrumented them from the start. This article sets up three pipelines through Terraform — logs, traces, and metrics — into a unified dashboard, adds six SLS queries to solve real incidents, and sets up four alarms that have actually fired and saved my projects in production.

By the end, you’ll have a DingTalk channel that alerts you before the bill explodes, latency increases, the error rate spikes, or an agent starts looping on itself — plus SLO budgets that turn operational feelings into data.

Terraform for AI Agents (6): LLM Gateway and Secrets Management

Sun, 22 Mar 2026 09:00:00 +0000

A pattern I see repeatedly in immature agent stacks: each agent has its own copy of OPENAI_API_KEY in its own .env file. Sometimes the same key, sometimes different ones, sometimes a colleague’s personal key from when they prototyped. When the bill arrives nobody can tell which agent caused which token spend, and when a key leaks (it always does) you’re playing whack-a-mole across a dozen .env files.

The real wake-up call hit me two years ago. A contractor finished his three-month engagement on a Friday, his laptop went home, and on the following Tuesday DashScope billing flagged 12 million tokens of qwen-max traffic from an IP we didn’t recognise. His personal API key — copy-pasted into a side project — was still sitting in our agent’s .env. Rotating it took six hours: three engineers, four repos, two CI pipelines, one panicked Slack thread. Never again.

Terraform for AI Agents (5): Storage — Vector, Relational, and Object Memory

Fri, 20 Mar 2026 09:00:00 +0000

Most tutorials gloss over an agent’s memory. ‘Just put the embeddings in Pinecone, the sessions in Postgres, and the screenshots in S3.’ On Aliyun, all three are managed services. Correctly provisioning them with Terraform can mean the difference between a working memory and losing three weeks of conversation history because the disk filled up at 4 AM.

This article covers all three layers, their Terraform configurations, the critical but tedious backup and disaster recovery (DR) setup, the major version upgrade process, and the Saturday outage that changed how I do things.

Terraform for AI Agents (4): Compute — ECS, ACK, or Function Compute?

Wed, 18 Mar 2026 09:00:00 +0000

The single most important architectural decision in an agent system is where the agent loop process runs. There are three good options on Aliyun, plus a fourth that almost everyone forgets. Picking the wrong one isn’t catastrophic — you can migrate later — but it costs weeks of unnecessary work and several thousand RMB a month in idle compute.

This article covers all four options with working Terraform, cost crossovers, and operational gotchas I often encounter.

Terraform for AI Agents (3): A Reusable VPC and Security Baseline

Mon, 16 Mar 2026 09:00:00 +0000

This article builds the single most copied piece of Terraform in my agent projects: a vpc-baseline module that gives every later component (ECS, RDS, OpenSearch, ACK) a sane place to land. It’s about 200 lines of HCL all-in. Worth typing once, refer to it forever.

By the end you’ll have:

A VPC across three availability zones in one region
Six vSwitches (one public + one private per zone) with non-overlapping CIDRs
A NAT Gateway with EIP for private-subnet outbound to LLM APIs
Three security groups stacked by tier (ALB → agent runtime → memory)
Three KMS customer master keys, one per data domain (memory, secrets, logs)
A clean module interface: name + CIDR + zones in, IDs out
Drift detection in CI, semver-pinned module references, and a per-line cost model

The mental model#

Before code, the picture:

Terraform for AI Agents (2): Provider, Auth, and Remote State on OSS

Sat, 14 Mar 2026 09:00:00 +0000

This article is where you stop reading and start typing. By the end, you’ll have:

The alicloud Terraform provider installed and version-pinned.
Authentication wired up — through the right method, not the convenient one.
Remote state on an OSS bucket with Tablestore-based locking.
Three workspaces (dev, staging, prod) that share a backend but isolate state.
A working terraform plan against an empty config.

Nothing here provisions an agent yet. This lays the foundation for all future articles. If you skip this and try to wing it in article 3, you’ll likely face a state-corruption incident within a week.

Terraform for AI Agents (1): Why IaC Is the Only Sane Way to Ship Agents

Thu, 12 Mar 2026 09:00:00 +0000

I have shipped four agent systems on Alibaba Cloud in the last eighteen months. Three of them started life as a tmux session on a single ECS instance someone created by clicking through the console. All three of those needed a panicked weekend of rebuilding when the second engineer joined the project, when the prod region had a stockout, or when the security team asked for a network diagram.

The fourth started life as terraform apply. It was the only one I haven’t lost a weekend to.