Terraform for AI Agents (3): A Reusable VPC and Security Baseline
The first reusable module — a three-zone VPC with public/private subnets, NAT egress, security groups layered by tier, and KMS keys per data domain. The same code shows up in every agent stack I've shipped, parameterised but otherwise unchanged.
This article builds the single most copied piece of Terraform in my agent projects: a vpc-baseline module that gives every later component (ECS, RDS, OpenSearch, ACK) a sane place to land.
By the end you’ll have:
- A VPC across three availability zones in one region
- Six subnets (one public + one private per zone) with non-overlapping CIDRs
- A NAT gateway with EIP for private-subnet outbound to LLM APIs
- Three security groups stacked by tier (ALB → agent runtime → memory)
- Three KMS customer master keys, one per data domain (memory, secrets, logs)
- A clean module interface: name + CIDR + zones in, IDs out
It’s about 200 lines of HCL all-in. Worth typing once, refer to it forever.
The mental model
Before code, the picture:

Why three zones? Because Aliyun reserves the right to do a zone-level maintenance on any given Sunday, and a single-zone deployment means your agents are offline for the whole window. Cross-zone traffic inside a VPC is free; the only cost of three zones is the operational complexity of subnet math.
Why public + private? The agent runtime should live in private subnets so a misconfigured security group can’t accidentally expose it on 0.0.0.0/0. Public subnets hold the ALB (load balancer) and the NAT gateway — things that must reach the internet. The agent reaches the internet via NAT, not directly.
The CIDR layout I use:
| Subnet | Zone | CIDR | Hosts |
|---|---|---|---|
public-a | l | 10.20.0.0/28 | 11 |
public-b | m | 10.20.0.16/28 | 11 |
public-c | n | 10.20.0.32/28 | 11 |
private-a | l | 10.20.1.0/24 | 251 |
private-b | m | 10.20.2.0/24 | 251 |
private-c | n | 10.20.3.0/24 | 251 |
Public subnets are /28 because they only hold a NAT and an ALB IP. Private subnets are /24 because that’s where the agent ECS, RDS, OpenSearch nodes live.
The module skeleton
Create the directory layout:
modules/vpc-baseline/
├── main.tf
├── variables.tf
├── outputs.tf
└── versions.tf
Inputs (variables.tf):
| |
Forcing exactly three zones is opinionated but matches the diagram. If you need two-zone or four-zone, fork the module — don’t make it conditional. Conditional modules become unreadable.
The VPC and subnets
main.tf, part one:
| |
Three things worth noting:
cidrsubnet(prefix, newbits, netnum)is Terraform’s CIDR math.cidrsubnet("10.20.0.0/16", 8, 1)returns"10.20.1.0/24". Memorise this — you’ll use it constantly.for_eachwith the index/value map gives stable resource addresses —alicloud_vswitch.private["0"]always points to the first zone, even if you rearrange the list. Compare tocount, where reordering causes wholesale recreation.substr(each.value, -1, 1)extracts the last char of the zone ID (thel/m/n) so resource names sort nicely.
NAT gateway and EIP
| |
The Enhanced NAT type is the modern one — required for Tablestore, PrivateLink, and most newer services. PayByTraffic is right for agent workloads where outbound bandwidth is bursty (LLM streaming) rather than steady.
The SNAT entries are what actually let private-subnet instances reach the internet. Without them, an agent in private-a cannot resolve dashscope.aliyuncs.com.
Security groups, layered
The right way to do security groups on Aliyun is one SG per tier, with rules that reference SG IDs not CIDRs:

| |
The key line is source_security_group_id = alicloud_security_group.alb_public.id. This says “accept inbound 8080 only from any instance in the ALB SG” — not from a CIDR. Re-IPing the ALB later doesn’t break anything.
Real-world tip: Aliyun’s default behaviour is to deny all ingress and allow all egress. The default is correct — don’t add a “deny all egress” rule, you’ll just break SDK calls. Limit egress only when you have a specific compliance requirement; for an agent system, all-egress-open is normal.
I extend this pattern for every downstream tier:
| |
By the time you’re done, attaching an ECS to the right SG is just security_groups = [module.vpc.agent_runtime_sg_id] and the network tier is correct by construction.
KMS keys per data domain
Encryption-at-rest is mandatory for any compliance regime worth its salt. The Aliyun way is one Customer Master Key (CMK) per data domain, so you can rotate one without touching another and audit access per-key.

| |
Why the alias? Because the CMK ID is a UUID nobody remembers; the alias alias/agents-prod-memory is human-readable and stable across key rotations. Reference the alias from RDS, OSS, etc. and you can swap the underlying key without touching downstream config.
pending_window_in_days = 7 means a deleted key has a 7-day window where you can recover it. Don’t shorten this — accidental key deletion is the kind of mistake that ends careers.
The module outputs
outputs.tf:
| |
These are exactly the IDs the next five articles need from us. By naming and shaping outputs deliberately, callers can do:
| |
Calling the module
In your top-level main.tf:
| |
terraform plan from the project root will produce something like:
Plan: 27 to add, 0 to change, 0 to destroy.
Changes to Outputs:
+ agent_runtime_sg_id = (known after apply)
+ nat_eip_address = (known after apply)
+ private_vswitch_ids = [
+ (known after apply),
+ (known after apply),
+ (known after apply),
]
+ vpc_id = (known after apply)
27 resources is about right (1 VPC + 6 vSwitch + 1 NAT + 1 EIP + 1 EIP-assoc + 3 SNAT + 4 SG + 4 SG-rule + 3 KMS key + 3 KMS alias = 27). Apply, and you have a production-grade network in about 90 seconds.
What it costs
Roughly, in cn-shanghai, monthly:
- VPC, vSwitch, security groups, KMS keys: free
- NAT gateway: ~¥120/mo for the Enhanced type, plus per-GB egress
- EIP: ~¥20/mo for IP reservation, plus PayByTraffic data
- KMS: free for the first 100 calls/day per key, then ~¥0.005/call
Call it ¥150-300/month for the network baseline at low-to-moderate traffic. Cheap for what you get — every later article inherits this skeleton.
What’s next
Article 4 lands compute on this network. Three patterns — ECS with pm2, ACK for production fleets, Function Compute for event-driven agents — and the cost-crossover model I use to pick between them. Then a real alicloud_instance block that bootstraps Python + Node + the agent runtime via cloud-init.
Real-world tip: If you ever need to add a fourth zone (Aliyun adds them periodically), it is a
terraform applyaway — thefor_eachpattern handles a longer list cleanly. But: thevalidationblock invariables.tfwill reject it, so you’ll first relax the validation. That deliberate friction is the point — adding a zone is a network change worth thinking about.