Terraform for AI Agents on Chen Kai Blog

Terraform for AI Agents (8): End-to-End — research-agent-stack in One Apply

Thu, 26 Mar 2026 09:00:00 +0000

This is the article where everything from articles 2 through 7 lands in one place. By the end you’ll have run terraform apply once and produced a complete, observable, budgeted agent runtime stack on Alibaba Cloud. About 31 resources, ~7 minutes of wall clock.

The stack we’re building:

Five layers — edge, compute, memory, platform, ops — composed from the modules we built across this series.

Terraform for AI Agents (7): Observability, SLS Dashboards, and Cost Alarms

Tue, 24 Mar 2026 09:00:00 +0000

Agents are non-deterministic, multi-step, and call expensive APIs. The combination means you cannot debug them after the fact unless you instrumented them on day one. This article wires three pipelines through Terraform — logs, traces, metrics — into a unified dashboard, then layers four alarms that have actually fired and saved my projects in production.

By the end you have one DingTalk channel that pings before the bill explodes, the latency dies, the error rate spikes, or some agent starts looping on itself.

Terraform for AI Agents (6): LLM Gateway and Secrets Management

Sun, 22 Mar 2026 09:00:00 +0000

A pattern I see repeatedly in immature agent stacks: each agent has its own copy of OPENAI_API_KEY in its own .env file. Sometimes the same key, sometimes different ones, sometimes a colleague’s personal key from when they prototyped. When the bill arrives nobody can tell which agent caused which token spend, and when a key leaks (it always does) you’re playing whack-a-mole across a dozen .env files.

This article ends that. We build one LLM gateway that:

Terraform for AI Agents (5): Storage — Vector, Relational, and Object Memory

Fri, 20 Mar 2026 09:00:00 +0000

An agent’s memory is the part most tutorials hand-wave. “Just put the embeddings in Pinecone, the sessions in Postgres, the screenshots in S3.” On Aliyun, all three exist as managed services, and Terraform-provisioning them right is the difference between “memory works” and “we lost three weeks of conversation history because the disk filled up at 4am”.

This article covers all three layers, the Terraform for each, and the boring-but-critical lifecycle and backup rules.

Terraform for AI Agents (4): Compute — ECS, ACK, or Function Compute?

Wed, 18 Mar 2026 09:00:00 +0000

The single most important architecture decision in an agent system is where the agent loop process actually runs. There are exactly three good answers on Aliyun. Picking the wrong one isn’t catastrophic — you can migrate later — but it costs you weeks of unnecessary scaffolding.

This article walks through all three with working Terraform, the cost crossover, and the operational gotchas.

The three patterns

Terraform for AI Agents (3): A Reusable VPC and Security Baseline

Mon, 16 Mar 2026 09:00:00 +0000

This article builds the single most copied piece of Terraform in my agent projects: a vpc-baseline module that gives every later component (ECS, RDS, OpenSearch, ACK) a sane place to land.

By the end you’ll have:

A VPC across three availability zones in one region
Six subnets (one public + one private per zone) with non-overlapping CIDRs
A NAT gateway with EIP for private-subnet outbound to LLM APIs
Three security groups stacked by tier (ALB → agent runtime → memory)
Three KMS customer master keys, one per data domain (memory, secrets, logs)
A clean module interface: name + CIDR + zones in, IDs out

It’s about 200 lines of HCL all-in. Worth typing once, refer to it forever.

Terraform for AI Agents (2): Provider, Auth, and Remote State on OSS

Sat, 14 Mar 2026 09:00:00 +0000

This is the article where you stop reading and start typing. By the end you will have:

The alicloud Terraform provider installed and version-pinned
Authentication wired up — through the right method, not the convenient one
Remote state on an OSS bucket with Tablestore-based locking
Three workspaces (dev, staging, prod) that share a backend but isolate state
A working terraform plan against an empty config

Nothing here provisions an agent yet. We’re laying the foundation that every later article assumes.

Terraform for AI Agents (1): Why IaC Is the Only Sane Way to Ship Agents

Thu, 12 Mar 2026 09:00:00 +0000

I have shipped four agent systems on Alibaba Cloud in the last eighteen months. Three of them started life as a tmux session on a single ECS instance someone created by clicking through the console. All three of those needed a panicked weekend of rebuilding when the second engineer joined the project, when the prod region had a stockout, or when the security team asked for a network diagram.

The fourth started life as terraform apply. It was the only one I haven’t lost a weekend to.