Series · Terraform Agents · Chapter 1

Terraform for AI Agents (1): Why IaC Is the Only Sane Way to Ship Agents

Agent systems are a moving target — new tools, new memory stores, new regions every month. Manual console clicks don't survive the second teammate. This first article makes the case for Terraform on Alibaba Cloud, surveys what the alicloud provider actually covers, and compares it to Pulumi, Crossplane, and ROS so you pick the right tool the first time.

I have shipped four agent systems on Alibaba Cloud in the last eighteen months. Three of them started life as a tmux session on a single ECS instance someone created by clicking through the console. All three of those needed a panicked weekend of rebuilding when the second engineer joined the project, when the prod region had a stockout, or when the security team asked for a network diagram.

The fourth started life as terraform apply. It was the only one I haven’t lost a weekend to.

This series is the field guide for that fourth pattern: how to use Terraform to provision the cloud infrastructure that an AI agent system actually needs on Alibaba Cloud. It is not a Terraform tutorial — there are good ones online and the official Get Started doc covers the basics. It is the senior-engineer playbook for the specific intersection of “I run agents” and “I run them on Aliyun”.

Eight articles. One real, working stack at the end. This first one is the why.

What “an agent system” actually requires

Before we talk infrastructure, let’s name the components an agent system has — the ones a pip install langgraph README usually skips:

  1. A runtime that holds the agent loop process — usually Python or Node — and survives restarts
  2. A vector store for semantic memory — embeddings of documents, prior conversations, tool outputs
  3. A relational store for session state — turn-by-turn conversation, tool-call traces, user identity
  4. An object store for artifacts — generated images, PDFs, screenshots, run snapshots
  5. An LLM gateway — one place that holds the API keys and enforces per-agent quotas
  6. Outbound network — to call DashScope, OpenAI, Anthropic, your scraping targets
  7. Observability — agent runs are non-deterministic, so logs and traces are not optional
  8. Secrets — provider keys, OAuth tokens, OSS credentials, database passwords
  9. Cost control — because token bills can 10x overnight when an agent loops on itself

That is at least nine separate Aliyun services touching each other in specific ways. Each has its own console page, its own RAM permissions, its own region scoping, its own networking. The probability that you can wire all of this up by hand and have it still match across dev, staging, and prod after three months of evolution is roughly zero.

The console-vs-IaC moment

The pain pattern is universal enough that I have a stock figure for it:

Console clicks vs Terraform — where the divergence happens

Read the left column carefully. Every step is plausible — none of them are dumb mistakes. They are what happens when smart people make small reasonable decisions over months. The right column is the same path, but every step leaves an artifact in git. The diff between the two columns is the difference between “I shipped this” and “I am paged at 2am because nobody knows what’s running in cn-beijing.”

The official Alibaba Cloud Terraform doc puts it more diplomatically. Quoting from the What Is Alibaba Cloud Terraform? topic:

Console operations: Click and enter parameters step by step. Repeat manual steps — hard to ensure consistency. Rely on documentation and verbal agreements.

Terraform: Describe the desired state of resources in configuration files. Configuration files are reviewable, shareable, and reusable. Store configuration files in version control. Changes are traceable and reversible.

That second paragraph is the entire pitch. Everything else in this series is implementation detail.

What Terraform actually is, in two sentences

Terraform is an open-source declarative tool from HashiCorp. You write .tf files in HashiCorp Configuration Language (HCL) that describe the cloud resources you want; Terraform diffs that desired state against the live state recorded in a state file and emits a plan; you review the plan; you apply it; Terraform translates the plan into provider API calls.

Three things to internalize from that:

  • Declarative, not imperative. You don’t say “create an instance” — you say “an instance of this shape exists.” Re-running the same config is a no-op if nothing changed. This is what makes Terraform safe to run from CI on every commit.
  • State is real. The terraform.tfstate file is a JSON map from your HCL resource addresses to the cloud’s actual resource IDs. Lose the state file and Terraform thinks nothing exists. Article 2 is about putting state somewhere durable.
  • Plan before apply. This is the killer feature. Every change shows you a literal diff of what will create, modify, or destroy before anything happens. Cultivate the habit of pasting the plan output into PR descriptions — your future self will thank you.

What the Aliyun provider covers

Cloud platforms talk to Terraform through provider plug-ins. The official alicloud provider was the first official Terraform provider in China and is maintained by Alibaba. As of this writing it ships 300+ resource types across roughly six domains:

alicloud provider coverage

Per the official What Is Alibaba Cloud Terraform? page, supported categories include:

  • Compute and containers: ECS, ACK (Kubernetes), Function Compute, Auto Scaling
  • Networking: VPC, SLB, ALB, NLB, NAT Gateway, Cloud Enterprise Network
  • Storage and databases: OSS, NAS, ApsaraDB RDS, PolarDB, Redis, MongoDB
  • Security and management: RAM, KMS, WAF
  • Big data and AI: MaxCompute, PAI

That covers everything in our nine-component checklist above except observability (SLS, ARMS, CloudMonitor — also covered, just not on this short list) and the per-LLM-provider key bits (which we’ll handle through KMS Secrets Manager in article 6).

A minimal HCL example, straight from the official ECS practice doc, looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
provider "alicloud" {
  region = "cn-shanghai"
}

resource "alicloud_vpc" "main" {
  vpc_name   = "agents-prod"
  cidr_block = "10.20.0.0/16"
}

resource "alicloud_vswitch" "private_a" {
  vpc_id     = alicloud_vpc.main.id
  cidr_block = "10.20.1.0/24"
  zone_id    = "cn-shanghai-l"
}

resource "alicloud_security_group" "agent_runtime" {
  name   = "agent-runtime-sg"
  vpc_id = alicloud_vpc.main.id
}

Three resources, with vpc_id references that Terraform resolves into the right dependency order automatically. You don’t say “first VPC, then vSwitch, then SG” — you write what you want and Terraform builds the DAG.

Modules: the unit of reuse

The single most important habit to build early is modules. A module is just a directory of .tf files that takes inputs and produces outputs. Once you have a working pattern — a VPC with three vSwitches, a NAT, and a security group baseline — wrap it in a module and you can stamp it out across dev, staging, prod, and intl-prod without copying HCL.

A bare-bones module call:

1
2
3
4
5
6
7
module "vpc" {
  source = "./modules/vpc-baseline"

  vpc_name   = "agents-${var.env}"
  cidr_block = "10.20.0.0/16"
  zones      = ["cn-shanghai-l", "cn-shanghai-m", "cn-shanghai-n"]
}

The body of ./modules/vpc-baseline/main.tf contains the actual alicloud_vpc, alicloud_vswitch, alicloud_nat_gateway resources. The caller doesn’t need to know — they just want a VPC with sane defaults. This is the same idea as a Python function, applied to infrastructure.

We will build exactly this module in article 3 and reuse it in every subsequent article.

Terraform vs Pulumi vs Crossplane vs ROS

Before you commit, a quick look at the alternatives. None are wrong; pick on team fit, not religion:

IaC tools compared

My honest read after using all four:

  • Terraform is the default. Largest ecosystem of providers, modules, and people who know it. HCL feels weird for the first day and fine after that. Pick this unless you have a strong reason not to.
  • Pulumi wraps Terraform providers but lets you write Python/TypeScript/Go. The expressiveness is real — you get loops, conditionals, and types your IDE actually checks. The cost is debugging: when something goes wrong, you’re now debugging through two layers (your code → Pulumi → TF provider). Worth it if your team genuinely hates HCL.
  • Crossplane lives in Kubernetes — every cloud resource becomes a CRD, and you kubectl apply your way to a VPC. Beautiful if you’re already a pure-Kubernetes shop with GitOps, painful if you aren’t.
  • ROS (Resource Orchestration Service) is Aliyun’s native equivalent. Deeply integrated with the console, JSON or YAML templates, no provider plug-in to install. Pick this only if you’re 100% on Aliyun forever and the ops team prefers a managed service.

The official Aliyun docs have a fair comparison in their FAQ:

Both [Terraform and ROS] are declarative IaC tools. Terraform is an open source, third-party tool that supports multi-cloud management. ROS is a native Alibaba Cloud service deeply integrated with the Alibaba Cloud Management Console. Choose Terraform if you need multi-cloud support or already use Terraform elsewhere.

For an agent system that calls multiple LLM providers and might one day need a US region or a Singapore region, multi-cloud-friendly Terraform is the right default.

What this series will and won’t do

What it will:

  • Take you from terraform init to a complete research-agent-stack running on Aliyun, in eight articles.
  • Show real, working HCL for VPC, ECS, ACK, OSS, RDS, OpenSearch, KMS, SLS, and CloudMonitor.
  • Cover the failure modes that are not in the docs — state drift, locked tfstate, GFW provider downloads, region stockouts.
  • Hand you a starter repo at the end you can fork.

What it won’t:

  • Teach you HCL syntax beyond what we use. The official HashiCorp tutorials do that better.
  • Teach you how to write the agent itself. There are series for LangGraph, AutoGen, MetaGPT, Claude Code already; pick one.
  • Compare Aliyun against AWS or GCP feature-by-feature. The IaC patterns translate across clouds; the resource names don’t.

What’s next

Article 2 is the first hands-on: installing the alicloud provider, picking your authentication method (the three choices — static AK/SK, AssumeRole, ECS RAM role — are not equivalent), setting up remote state on OSS with Tablestore for locking, and the workspace pattern for dev/staging/prod.

If you only do one thing today, install Terraform (brew install terraform on macOS, or follow the official Install Terraform topic) and run terraform version to confirm. The rest of the series assumes you have it.

Real-world tip: Pin the alicloud provider version in required_providers from day one. The provider is actively developed and breaking changes between minor versions are rare but not zero. A pinned version means your Friday terraform plan returns the same result on Monday.

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub