Terraform for AI Agents (4): Compute — ECS, ACK, or Function Compute?
The three places an agent's main loop can live on Aliyun: a long-running ECS instance with pm2, a Kubernetes pod on ACK, or a Function Compute invocation. The cost-crossover model I use to pick between them, and a real cloud-init bootstrap that goes from bare Ubuntu to running agent in 90 seconds.
The single most important architecture decision in an agent system is where the agent loop process actually runs. There are exactly three good answers on Aliyun. Picking the wrong one isn’t catastrophic — you can migrate later — but it costs you weeks of unnecessary scaffolding.
This article walks through all three with working Terraform, the cost crossover, and the operational gotchas.
The three patterns

Each has a sweet spot:
- ECS is a Linux VM. Long-lived, stateful, easy to SSH into when you’re debugging. The right answer for prototypes, single-tenant agents, and anywhere you want to keep one machine “warm” with cached models or local state.
- ACK (Container Service for Kubernetes) is the prod answer at scale. Multiple agent kinds, autoscaling, rolling deploys, GPU scheduling. Worth the operational weight only when you have at least three or four agent services and an SRE who’s comfortable in K8s.
- Function Compute is per-invocation, scale-to-zero. Cold start 200-800ms. Right for webhook-triggered agents, scheduled crawlers, and anything where the agent runs in bursts and idles otherwise.
The cost crossover
Here’s the rough monthly cost picture as a function of sustained QPS:

Under ~1 QPS sustained, FC dominates — you pay almost nothing during idle. From ~1 to ~30 QPS sustained, a single ECS box wins. Above that, ACK’s higher fixed cost is amortised over enough load to be cheaper than packing more onto ECS.
The model is rough — your actual numbers depend on instance family, network, and how chatty the agent is — but the shape is reliable. The decision rule I use:
Bursty + low average → Function Compute
Steady + low-to-mid → ECS with pm2
Multi-agent + sustained mid-to-high → ACK
Pattern 1: ECS with pm2
For 80% of agent projects, this is what you want. One or two ECS instances behind an ALB, each running pm2 as the supervisor for the Python or Node agent process.
The official “Create an ECS instance” practice doc gives a working baseline. Adapted for our agent context:
| |
Three things worth highlighting:
datablocks pick the image and instance type instead of hardcoding them.ubuntu_22_04_x64resolves to the latest patched image;data.alicloud_instance_types.agentfinds anecs.c7with 4 vCPUs and 16 GiB. When Aliyun deprecates an image SKU, your next plan picks the new one automatically.system_disk_kms_key_idties the disk to thememoryCMK from article 3. Encryption-at-rest costs nothing extra and removes a whole compliance headache.lifecycle { create_before_destroy = true }means a planned replace creates the new instance, attaches it to the ALB, drains the old, then destroys it — zero-downtime rotation. The trade-off is you briefly need 2× capacity, which is fine for two-instance fleets and starts to matter at 50.
The cloud-init bootstrap
cloud-init.sh is a templatefile that boots the box from bare Ubuntu to running agent:
| |
The flow:

About 90 seconds from apply to pm2 status showing the agent as online. The first apt-get install is the slow step (~60s). Once you have a stable image, bake it with Packer so future ECS instances skip apt entirely and boot in 25 seconds.
Real-world tip:
user_datais logged to/var/log/cloud-init-output.logon the instance. When an agent doesn’t come up, that’s where you look first. Addset -euxo pipefailat the top so failures are loud and traceable.
Pattern 2: ACK for production fleets
Once you have three or more agent kinds running side by side, the per-VM operational cost dominates. ACK gives you one cluster, one scheduler, one upgrade path.
The minimal Terraform to get a managed K8s cluster on Aliyun:
| |
A few notes:
ack.pro.smallis the managed control plane SKU. Aliyun runs the masters; you only pay for the worker ECS. Don’t pick the unmanaged SKU unless you have a strong reason.pod_vswitch_idsis for Terway, the Aliyun-native CNI. Each pod gets a real VPC IP — no overlay network, security groups apply directly. This is the right default; the alternative (Flannel) makes networking debugging miserable.delete_protection = truedoes what it says —terraform destroywon’t kill the cluster. Set this on every prod cluster.- The
addonsblock enables ARMS Prometheus (article 7) and the SLS log collector. Provisioning these via Terraform means new clusters come pre-instrumented.
The actual agent pods come from a Kubernetes deployment manifest — usually applied by a separate kubectl step or via the kubernetes Terraform provider. I keep the cluster in this terraform project and the workloads in a separate Helm chart, because they have different release cadences.
Pattern 3: Function Compute for event-driven agents
Some agents only run when triggered — a webhook fires, a cron tick happens, an OSS object lands. For those, FC is unbeatable: zero idle cost, automatic scale-out, and the cloud handles the runtime entirely.
| |
What this gives you: a Python 3.11 function, 1 GiB RAM, 10-minute timeout, attached to the same VPC and security group as the rest of your stack, triggered every day at 9am. Zero servers to maintain. Cost: roughly ¥0.10 per invocation at this size, plus ¥0.0001 per GB-second of execution.
Three caveats I keep tripping on:
- Cold start. First invocation after idle takes 200-800ms more than subsequent ones. For a webhook with sub-second SLA, this matters; for a cron task, it doesn’t. Provisioned concurrency exists but defeats the point of FC.
- VPC attachment adds another 200-400ms to cold starts because FC has to attach an ENI to your VPC. Worth it if the function needs to reach RDS/OpenSearch; skip the
vpc_configblock if it only calls public APIs. - 24-hour max runtime. For long agent loops, FC is a bad fit. Either chunk the loop into shorter steps or use ECS.
A real example: hybrid
Most production agent stacks I’ve shipped end up hybrid:
- ECS for the always-on conversational agent that holds session state
- ACK for the worker fleet that processes background jobs
- FC for webhook receivers and daily/hourly cron tasks
Terraform makes this trivial — three modules in the same project, sharing the VPC and security groups from article 3. The skill is knowing which pattern fits which workload, not learning all the resource syntaxes.
Right-sizing the instance
A common question: which ecs.* family for an agent runtime? My defaults:
| Workload | Family | Why |
|---|---|---|
| Conversational agent, no GPU | ecs.c7 | CPU-bound on tokenisation + I/O on LLM calls |
| Memory-heavy (large context) | ecs.r7 | More RAM per vCPU |
| Batch / scheduled with bursts | ecs.c7a (AMD) | ~15% cheaper, slightly slower per core |
| GPU inference of small models | ecs.gn7i | T4-class, cheapest GPU on Aliyun |
| Pretraining / large fine-tune | Use PAI-DLC, not ECS | Don’t reinvent the orchestration |
Avoid the burstable ecs.t6 family for agent runtime — CPU credits run out under sustained load and your latency goes off a cliff. They’re fine for the bastion that runs terraform apply and not much else.
Real-world tip: Use
data.alicloud_instance_typesto ask the API for “give me a 4-vCPU 16-GiB instance available in this zone right now”. Hardcodingecs.c7.xlargeworks until that exact SKU is out of stock in your zone, at which point Terraform fails. Letting the data source pick gives you graceful fallback.
What’s next
Article 5 fills in the storage layer — vector store, relational, object store, backups — that everything we just provisioned needs to talk to. ECS instances are useless until they have somewhere to put memory.
Then article 6 builds the LLM gateway in front of all the compute, article 7 wires observability and cost alarms, and article 8 stitches everything into one terraform apply.