Series · Terraform Agents · Chapter 1

Terraform for AI Agents (1): Why IaC Is the Only Sane Way to Ship Agents

Agent systems are a moving target — new tools, new memory stores, new regions every month. Manual console clicks don't survive the second teammate. This first article makes the case for Terraform on Alibaba Cloud, surveys what the alicloud provider actually covers, and compares it to Pulumi, Crossplane, and ROS so you pick the right tool the first time.

I have shipped four agent systems on Alibaba Cloud in the last eighteen months. Three of them started life as a tmux session on a single ECS instance someone created by clicking through the console. All three of those needed a panicked weekend of rebuilding when the second engineer joined the project, when the prod region had a stockout, or when the security team asked for a network diagram.

The fourth started life as terraform apply. It was the only one I haven’t lost a weekend to.

This series is the field guide for that fourth pattern: how to use Terraform to provision the cloud infrastructure that an AI agent system actually needs on Alibaba Cloud. It is not a Terraform tutorial — there are good ones online and the official Get Started doc covers the basics. It is the senior-engineer playbook for the specific intersection of “I run agents” and “I run them on Aliyun”.

Eight articles. One real, working stack at the end. This first one explains the why.

Terraform for AI Agents (1): Why IaC Is the Only Sane Way to Ship Agents — Chapter overview


What “an agent system” actually requires#

Before we talk infrastructure, let’s name the components an agent system has — the ones a pip install langgraph README usually skips:

AI agent workloads running on cloud infrastructure

  1. A runtime that holds the agent loop process — usually Python or Node — and survives restarts
  2. A vector store for semantic memory — embeddings of documents, prior conversations, tool outputs
  3. A relational store for session state — turn-by-turn conversation, tool-call traces, user identity
  4. An object store for artifacts — generated images, PDFs, screenshots, run snapshots
  5. An LLM gateway — one place that holds the API keys and enforces per-agent quotas
  6. Outbound network — to call DashScope, OpenAI, Anthropic, your scraping targets
  7. Observability — agent runs are non-deterministic, so logs and traces are not optional
  8. Secrets — provider keys, OAuth tokens, OSS credentials, database passwords
  9. Cost control — because token bills can 10x overnight when an agent loops on itself

That is at least nine separate Aliyun services touching each other in specific ways. Each has its own console page, its own RAM permissions, its own region scoping, its own networking. The probability that you can wire all of this up by hand and have it still match across dev, staging, and prod after three months of evolution is roughly zero.

The console-vs-IaC moment#

Managing nine services manually creates nine drift surfaces. This issue is so common that I have a standard figure for it:

Infrastructure as Code workflow transforming declarative configs into cloud resources

Console clicks vs Terraform — where the divergence happens

Read the left column carefully. Every step is plausible — none of them are dumb mistakes. They are what happens when smart people make small reasonable decisions over months. The right column is the same path, but every step leaves an artifact in git. The diff between the two columns is the difference between “I shipped this” and “I am paged at 2am because nobody knows what’s running in cn-beijing.”

The official Alibaba Cloud Terraform doc puts it more diplomatically:

Console operations: Click and enter parameters step by step. Repeat manual steps — hard to ensure consistency. Rely on documentation and verbal agreements.

Terraform: Describe the desired state of resources in configuration files. Configuration files are reviewable, shareable, and reusable. Store configuration files in version control. Changes are traceable and reversible.

That second paragraph is the entire pitch. Everything else in this series is implementation detail.

What Terraform actually is, in two sentences#

Terraform is an open-source declarative tool from HashiCorp. You write .tf files in HashiCorp Configuration Language (HCL) that describe the cloud resources you want; Terraform diffs that desired state against the live state recorded in a state file and emits a plan; you review the plan; you apply it; Terraform translates the plan into provider API calls.

Three key points to remember:

  • Declarative, not imperative. You don’t say “create an instance” — you say “an instance of this shape exists.” Re-running the same config is a no-op if nothing changed. This is what makes Terraform safe to run from CI on every commit.
  • State is real. The terraform.tfstate file is a JSON map from your HCL resource addresses to the cloud’s actual resource IDs. Lose the state file and Terraform thinks nothing exists. Article 2 is about putting state somewhere durable — but the implications run deeper than “don’t lose the file,” and we’ll come back to that below.
  • Plan before apply. This is the killer feature. Every change shows you a literal diff of what will create, modify, or destroy before anything happens. Cultivate the habit of pasting the plan output into PR descriptions — your future self will thank you.

State as the agent stack’s bill of materials#

The “state is real” point deserves more than one bullet, because for an agent stack the state file is doing double duty as your inventory.

Every agent stack I’ve shipped has been audited at some point—by a security review, by finance reconciling cloud spend, or by a new SRE trying to figure out what’s running. The question is always the same: what exists, who created it, and what is it costing me?

If your infrastructure is in a Terraform state file, that question takes 30 seconds to answer:

1
2
3
terraform state list | wc -l                                  # how many resources
terraform state list | awk -F. '{print $1"."$2}' | sort -u    # what kinds
terraform show -json | jq '[.values.root_module.resources[] | {addr:.address, type:.type}]'

For the four agent stacks I run today, those three commands produce a comprehensive inventory in seconds. Before Terraform, the same audit required opening twelve console tabs across ECS, VPC, RDS, OSS, RAM, KMS, SLS, ARMS, ACK, CloudMonitor, ALB, and OpenSearch—filtered by tags if I was lucky, and by gut feeling if I wasn’t.

The state file is also a bill of materials in the supply-chain sense. Each resource carries its provider version and module source. When a CVE drops on the alicloud provider — and it does, a couple of times a year — you grep state files across all your projects in minutes:

1
2
3
4
5
for d in stack-*/; do
  (cd "$d" && terraform providers schema -json 2>/dev/null \
     | jq -r '.provider_schemas | keys[]' \
     | grep -F 'aliyun/alicloud' && echo "  in $d")
done

The deeper point: state turns infrastructure into data. Once it’s data, you can write tooling against it. I have a small Python script that walks every state file across every project and produces a single CSV of (stack, resource_type, resource_id, region, monthly_cost_estimate). That CSV gets read by my monthly cost meeting. None of it would exist without the state file as a uniform structured source of truth.

The flip side is that the state file is precious. Lose it and the universe of resources becomes invisible to Terraform — you’ll get wholesale “resource already exists” errors on the next plan. Article 2 is entirely about putting the state somewhere that won’t disappear.

What the Aliyun provider covers#

State is only useful if there’s a provider that can actually create the things you’ve declared. Cloud platforms talk to Terraform through provider plug-ins. The official alicloud provider was the first official Terraform provider in China and is maintained by Alibaba. As of this writing it ships 300+ resource types across roughly six domains:

alicloud provider coverage

Per the official documentation, supported categories include:

  • Compute and containers: ECS, ACK (Kubernetes), Function Compute, Auto Scaling
  • Networking: VPC, SLB, ALB, NLB, NAT Gateway, Cloud Enterprise Network
  • Storage and databases: OSS, NAS, ApsaraDB RDS, PolarDB, Redis, MongoDB
  • Security and management: RAM, KMS, WAF
  • Observability: SLS, ARMS, CloudMonitor
  • Big data and AI: MaxCompute, PAI

That covers everything in our nine-component checklist, including the per-LLM-provider keys (handled through KMS Secrets Manager in article 6).

A minimal HCL example, with a pinned provider version because you should always pin from day one:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
terraform {
  required_version = ">= 1.6.0"
  required_providers {
    alicloud = {
      source  = "aliyun/alicloud"
      version = "~> 1.230"
    }
  }
}

provider "alicloud" {
  region = "cn-shanghai"
}

resource "alicloud_vpc" "main" {
  vpc_name   = "agents-prod"
  cidr_block = "10.20.0.0/16"
}

resource "alicloud_vswitch" "private_a" {
  vpc_id     = alicloud_vpc.main.id
  cidr_block = "10.20.1.0/24"
  zone_id    = "cn-shanghai-l"
}

resource "alicloud_security_group" "agent_runtime" {
  name   = "agent-runtime-sg"
  vpc_id = alicloud_vpc.main.id
}

Three resources, with vpc_id references that Terraform resolves into the right dependency order automatically. You don’t say “first VPC, then vSwitch, then SG” — you write what you want and Terraform builds the DAG.

Modules: the unit of reuse#

Three resources is the toy version. Real stacks have hundreds, and the way you keep hundreds tractable is modules. A module is just a directory of .tf files that takes inputs and produces outputs. Once you have a working pattern — a VPC with three vSwitches, a NAT, and a security group baseline — wrap it in a module and you can stamp it out across dev, staging, prod, and intl-prod without copying HCL.

A bare-bones module call:

1
2
3
4
5
6
7
8
9
module "vpc" {
  source = "./modules/vpc-baseline"

  for_each = toset(["dev", "staging", "prod"])

  vpc_name   = "agents-${each.key}"
  cidr_block = "10.20.0.0/16"
  zones      = ["cn-shanghai-l", "cn-shanghai-m", "cn-shanghai-n"]
}

The body of ./modules/vpc-baseline/main.tf contains the actual alicloud_vpc, alicloud_vswitch, alicloud_nat_gateway resources. The caller doesn’t need to know — they just want a VPC with sane defaults. Same idea as a Python function, applied to infrastructure. (Use for_each over count whenever the iteration is over a meaningful set of names — count re-numbers everything when you delete the middle item, which causes Terraform to destroy and recreate unrelated resources.)

We will build exactly this module in article 3 and reuse it in every subsequent article.

The agent-specific failure modes IaC actually prevents#

Before comparing Terraform against alternatives, it’s worth pinning down what specifically you’re buying. Generic IaC pitches focus on “consistency” and “reproducibility” — true but underwhelming. After three years of running these systems, the failure modes that have hurt me most are agent-shaped, and each one has a Terraform-shaped fix:

Infrastructure drift detection — when reality diverges from declared state

1. The 3am token leak. An agent loops on itself — bad stop condition, infinite tool retry, hallucinated planner state — and burns ¥40,000 of LLM budget overnight. The console-clicked stack has no programmatic budget guard because nobody wrote one. The Terraform stack has alicloud_log_alert provisioned from day one (article 7) because the module includes it by default. The cost of the alert is one extra resource in the plan; the cost of not having it is a phone call from finance.

2. The “who has my keys” panic. A contractor leaves. They had a copy of the OpenAI key they prototyped with. The console-clicked stack scatters that key in .env files on three ECS instances and a DingTalk DM. Rotating it is a half-day of grep -r OPENAI_API_KEY ~/projects and praying. The Terraform stack puts every key in KMS Secrets Manager (article 6), behind a RAM role, with one canonical location — rotation is terraform apply after editing secrets.auto.tfvars.

3. The phantom NAT charge. Agents are chatty; LLM calls are the chattiest. A misconfigured route makes them go through a public NAT in the wrong region and your egress bill triples. The console doesn’t tell you this — it just shows the line item next month. The Terraform alicloud_nat_gateway is in the same module as the alicloud_vswitch it serves, so the dependency is visible at plan time. Article 3 makes this the default.

4. The “what region is prod in?” question. Aliyun has 30+ regions. An engineer who joined two months ago, debugging an outage, doesn’t know whether the prod RDS is in cn-shanghai or cn-beijing and has to grep through DingTalk history. The Terraform answer is terraform output rds_endpoint. Resolved in five seconds, traceable forever.

5. The “we lost the LLM gateway config” weekend. The gateway routes models, holds quotas, logs traffic. It’s two ECS instances that someone configured by SSH-ing in and editing /etc/litellm/config.yaml. Server reboots wipe /etc/litellm. The Terraform version puts the config in cloud-init (article 6) — every instance comes up with the same config, no manual step.

None of these failure modes are exotic. Each has happened to a project I’ve worked on. Each is prevented, not just recovered from, by the IaC pattern. That’s the actual case for Terraform on an agent stack — not abstract reproducibility, but a concrete list of disasters that don’t occur.

Terraform vs Pulumi vs Crossplane vs ROS#

Granting the case for IaC, why Terraform specifically? A quick look at the alternatives. None are wrong; pick on team fit, not religion:

IaC tools compared

My honest read after using all four:

  • Terraform is the default. Largest ecosystem of providers, modules, and people who know it. HCL feels weird for the first day and fine after that. Pick this unless you have a strong reason not to.
  • Pulumi wraps Terraform providers but lets you write Python/TypeScript/Go. The expressiveness is real — you get loops, conditionals, and types your IDE actually checks. The cost is debugging: when something goes wrong, you’re now debugging through two layers (your code → Pulumi → TF provider). Worth it if your team genuinely hates HCL.
  • Crossplane lives in Kubernetes — every cloud resource becomes a CRD, and you kubectl apply your way to a VPC. Beautiful if you’re already a pure-Kubernetes shop with GitOps, painful if you aren’t.
  • ROS (Resource Orchestration Service) is Aliyun’s native equivalent. Deeply integrated with the console, JSON or YAML templates, no provider plug-in to install. Pick this only if you’re 100% on Aliyun forever and the ops team prefers a managed service.

The Aliyun docs themselves are fair on the comparison:

Both [Terraform and ROS] are declarative IaC tools. Terraform is an open source, third-party tool that supports multi-cloud management. ROS is a native Alibaba Cloud service deeply integrated with the Alibaba Cloud Management Console. Choose Terraform if you need multi-cloud support or already use Terraform elsewhere.

For an agent system that calls multiple LLM providers and might one day need a US region or a Singapore region, multi-cloud-friendly Terraform is the right default.

What this series will and won’t do#

What it will:

  • Take you from terraform init to a complete research-agent-stack running on Aliyun, in eight articles.
  • Show real, working HCL for VPC, ECS, ACK, OSS, RDS, OpenSearch, KMS, SLS, and CloudMonitor.
  • Cover the failure modes that are not in the docs — state drift, locked tfstate, GFW provider downloads, region stockouts.
  • Hand you a starter repo at the end you can fork.

What it won’t:

  • Teach you HCL syntax beyond what we use. The official HashiCorp tutorials do that better.
  • Teach you how to write the agent itself. There are series for LangGraph, AutoGen, MetaGPT, Claude Code already; pick one.
  • Compare Aliyun against AWS or GCP feature-by-feature. The IaC patterns translate across clouds; the resource names don’t.

What’s Next#

Article 2 is the first hands-on: installing the alicloud provider, picking your authentication method (the three choices — static AK/SK, AssumeRole, ECS RAM role — are not equivalent), setting up remote state on OSS with Tablestore for locking, and the workspace pattern for dev/staging/prod.

If you only do one thing today, install Terraform (brew install terraform on macOS, or follow the official Install Terraform topic) and run terraform version to confirm. The rest of the series assumes you have it.

Real-world tip: Pin the alicloud provider version in required_providers from day one, and pin Terraform itself with required_version. The provider is actively developed and breaking changes between minor versions are rare but not zero. A pinned version means your Friday terraform plan returns the same result on Monday.

In this series

Terraform Agents 8 parts

  1. 01 Terraform for AI Agents (1): Why IaC Is the Only Sane Way to Ship Agents you are here
  2. 02 Terraform for AI Agents (2): Provider, Auth, and Remote State on OSS
  3. 03 Terraform for AI Agents (3): A Reusable VPC and Security Baseline
  4. 04 Terraform for AI Agents (4): Compute — ECS, ACK, or Function Compute?
  5. 05 Terraform for AI Agents (5): Storage — Vector, Relational, and Object Memory
  6. 06 Terraform for AI Agents (6): LLM Gateway and Secrets Management
  7. 07 Terraform for AI Agents (7): Observability, SLS Dashboards, and Cost Alarms
  8. 08 Terraform for AI Agents (8): End-to-End — research-agent-stack in One Apply

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub