Series · Terraform Agents · Chapter 4

Terraform for AI Agents (4): Compute — ECS, ACK, or Function Compute?

The three places an agent's main loop can live on Aliyun: a long-running ECS instance with pm2, a Kubernetes pod on ACK, or a Function Compute invocation. The cost-crossover model I use to pick between them, and a real cloud-init bootstrap that goes from bare Ubuntu to running agent in 90 seconds.

The single most important architecture decision in an agent system is where the agent loop process actually runs. There are exactly three good answers on Aliyun. Picking the wrong one isn’t catastrophic — you can migrate later — but it costs you weeks of unnecessary scaffolding.

This article walks through all three with working Terraform, the cost crossover, and the operational gotchas.

The three patterns

Three places to run an agent: ECS, ACK, FC

Each has a sweet spot:

  • ECS is a Linux VM. Long-lived, stateful, easy to SSH into when you’re debugging. The right answer for prototypes, single-tenant agents, and anywhere you want to keep one machine “warm” with cached models or local state.
  • ACK (Container Service for Kubernetes) is the prod answer at scale. Multiple agent kinds, autoscaling, rolling deploys, GPU scheduling. Worth the operational weight only when you have at least three or four agent services and an SRE who’s comfortable in K8s.
  • Function Compute is per-invocation, scale-to-zero. Cold start 200-800ms. Right for webhook-triggered agents, scheduled crawlers, and anything where the agent runs in bursts and idles otherwise.

The cost crossover

Here’s the rough monthly cost picture as a function of sustained QPS:

Compute cost crossover — rough model

Under ~1 QPS sustained, FC dominates — you pay almost nothing during idle. From ~1 to ~30 QPS sustained, a single ECS box wins. Above that, ACK’s higher fixed cost is amortised over enough load to be cheaper than packing more onto ECS.

The model is rough — your actual numbers depend on instance family, network, and how chatty the agent is — but the shape is reliable. The decision rule I use:

Bursty + low average → Function Compute

Steady + low-to-mid → ECS with pm2

Multi-agent + sustained mid-to-high → ACK

Pattern 1: ECS with pm2

For 80% of agent projects, this is what you want. One or two ECS instances behind an ALB, each running pm2 as the supervisor for the Python or Node agent process.

The official “Create an ECS instance” practice doc gives a working baseline. Adapted for our agent context:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
data "alicloud_images" "ubuntu" {
  owners      = "system"
  name_regex  = "^ubuntu_22_04_x64.*"
  most_recent = true
}

data "alicloud_instance_types" "agent" {
  cpu_core_count       = 4
  memory_size          = 16
  availability_zone    = "cn-shanghai-l"
  instance_type_family = "ecs.c7"
}

resource "alicloud_instance" "agent" {
  count = var.agent_count

  instance_name        = "agent-${terraform.workspace}-${count.index + 1}"
  image_id             = data.alicloud_images.ubuntu.images[0].id
  instance_type        = data.alicloud_instance_types.agent.instance_types[0].id
  availability_zone    = "cn-shanghai-l"

  vswitch_id      = module.vpc.private_vswitch_ids[count.index % 3]
  security_groups = [module.vpc.agent_runtime_sg_id]

  system_disk_category = "cloud_essd"
  system_disk_size     = 80
  system_disk_encrypted = true
  system_disk_kms_key_id = module.vpc.kms_keys["memory"]

  user_data = base64encode(templatefile("${path.module}/cloud-init.sh", {
    repo_url       = var.agent_repo_url
    branch         = var.agent_branch
    gateway_url    = "http://${alicloud_alb_listener.gateway.id}.alb.aliyuncs.com"
    sls_project    = alicloud_log_project.agents.name
    sls_logstore   = alicloud_log_store.agent_runs.name
  }))

  tags = {
    Role = "agent-runtime"
    App  = "research-agent"
  }

  lifecycle {
    create_before_destroy = true
    ignore_changes = [user_data]    # don't replace on every cloud-init bump
  }
}

Three things worth highlighting:

  1. data blocks pick the image and instance type instead of hardcoding them. ubuntu_22_04_x64 resolves to the latest patched image; data.alicloud_instance_types.agent finds an ecs.c7 with 4 vCPUs and 16 GiB. When Aliyun deprecates an image SKU, your next plan picks the new one automatically.
  2. system_disk_kms_key_id ties the disk to the memory CMK from article 3. Encryption-at-rest costs nothing extra and removes a whole compliance headache.
  3. lifecycle { create_before_destroy = true } means a planned replace creates the new instance, attaches it to the ALB, drains the old, then destroys it — zero-downtime rotation. The trade-off is you briefly need 2× capacity, which is fine for two-instance fleets and starts to matter at 50.

The cloud-init bootstrap

cloud-init.sh is a templatefile that boots the box from bare Ubuntu to running agent:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash
set -euxo pipefail

# Update apt and install base deps
apt-get update -y
apt-get install -y python3.11 python3.11-venv git curl ca-certificates

# Node 20 for any JS tooling the agent shells out to
curl -fsSL https://deb.nodesource.com/setup_20.x | bash -
apt-get install -y nodejs

# pm2 as the process supervisor
npm install -g pm2 uv

# Clone the agent runtime
mkdir -p /opt/agent
cd /opt/agent
git clone --depth 1 -b ${branch} ${repo_url} src
cd src
uv venv .venv
uv pip sync requirements.txt

# Wire env vars (these come from Terraform via the templatefile)
cat > /opt/agent/src/.env <<EOF
LLM_GATEWAY_URL=${gateway_url}
SLS_PROJECT=${sls_project}
SLS_LOGSTORE=${sls_logstore}
EOF

# Start under pm2 and persist
pm2 start ecosystem.config.js
pm2 save
pm2 startup systemd -u root --hp /root

The flow:

Cloud-init bootstrap flow

About 90 seconds from apply to pm2 status showing the agent as online. The first apt-get install is the slow step (~60s). Once you have a stable image, bake it with Packer so future ECS instances skip apt entirely and boot in 25 seconds.

Real-world tip: user_data is logged to /var/log/cloud-init-output.log on the instance. When an agent doesn’t come up, that’s where you look first. Add set -euxo pipefail at the top so failures are loud and traceable.

Pattern 2: ACK for production fleets

Once you have three or more agent kinds running side by side, the per-VM operational cost dominates. ACK gives you one cluster, one scheduler, one upgrade path.

The minimal Terraform to get a managed K8s cluster on Aliyun:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
resource "alicloud_cs_managed_kubernetes" "agents" {
  name_prefix          = "agents-${terraform.workspace}"
  version              = "1.30.1-aliyun.1"
  cluster_spec         = "ack.pro.small"
  vpc_id               = module.vpc.vpc_id
  worker_vswitch_ids   = module.vpc.private_vswitch_ids
  pod_vswitch_ids      = module.vpc.private_vswitch_ids
  service_cidr         = "172.21.0.0/20"
  proxy_mode           = "ipvs"
  load_balancer_spec   = "slb.s2.small"
  enable_ssh           = false
  delete_protection    = true
  control_plane_log_components = ["apiserver", "audit"]

  addons {
    name = "managed-arms-prometheus"
  }
  addons {
    name = "logtail-ds"
  }
}

resource "alicloud_cs_kubernetes_node_pool" "agents" {
  cluster_id           = alicloud_cs_managed_kubernetes.agents.id
  node_pool_name       = "agent-workers"
  vswitch_ids          = module.vpc.private_vswitch_ids
  instance_types       = ["ecs.c7.xlarge"]
  desired_size         = 3
  system_disk_category = "cloud_essd"
  system_disk_size     = 80
  install_cloud_monitor = true
  scaling_config {
    enable    = true
    min_size  = 2
    max_size  = 10
  }
}

A few notes:

  • ack.pro.small is the managed control plane SKU. Aliyun runs the masters; you only pay for the worker ECS. Don’t pick the unmanaged SKU unless you have a strong reason.
  • pod_vswitch_ids is for Terway, the Aliyun-native CNI. Each pod gets a real VPC IP — no overlay network, security groups apply directly. This is the right default; the alternative (Flannel) makes networking debugging miserable.
  • delete_protection = true does what it says — terraform destroy won’t kill the cluster. Set this on every prod cluster.
  • The addons block enables ARMS Prometheus (article 7) and the SLS log collector. Provisioning these via Terraform means new clusters come pre-instrumented.

The actual agent pods come from a Kubernetes deployment manifest — usually applied by a separate kubectl step or via the kubernetes Terraform provider. I keep the cluster in this terraform project and the workloads in a separate Helm chart, because they have different release cadences.

Pattern 3: Function Compute for event-driven agents

Some agents only run when triggered — a webhook fires, a cron tick happens, an OSS object lands. For those, FC is unbeatable: zero idle cost, automatic scale-out, and the cloud handles the runtime entirely.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
resource "alicloud_fc_service" "agent" {
  name        = "agent-${terraform.workspace}"
  description = "Event-triggered agent functions"

  log_config {
    project  = alicloud_log_project.agents.name
    logstore = alicloud_log_store.agent_runs.name
  }

  vpc_config {
    vswitch_ids       = module.vpc.private_vswitch_ids
    security_group_id = module.vpc.agent_runtime_sg_id
  }

  role          = alicloud_ram_role.fc_agent.arn
  internet_access = false
}

resource "alicloud_fc_function" "scheduled_research" {
  service     = alicloud_fc_service.agent.name
  name        = "scheduled-research"
  description = "Daily research agent run"
  filename    = "${path.module}/dist/scheduled-research.zip"
  handler     = "index.handler"
  runtime     = "python3.11"
  memory_size = 1024
  timeout     = 600
  ca_port     = 8080

  environment_variables = {
    LLM_GATEWAY_URL = "http://${alicloud_alb_listener.gateway.id}.alb.aliyuncs.com"
  }
}

resource "alicloud_fc_trigger" "daily" {
  service  = alicloud_fc_service.agent.name
  function = alicloud_fc_function.scheduled_research.name
  name     = "daily-9am"
  type     = "timer"
  config = jsonencode({
    cronExpression = "0 0 9 * * *"
    enable         = true
    payload        = "{}"
  })
}

What this gives you: a Python 3.11 function, 1 GiB RAM, 10-minute timeout, attached to the same VPC and security group as the rest of your stack, triggered every day at 9am. Zero servers to maintain. Cost: roughly ¥0.10 per invocation at this size, plus ¥0.0001 per GB-second of execution.

Three caveats I keep tripping on:

  1. Cold start. First invocation after idle takes 200-800ms more than subsequent ones. For a webhook with sub-second SLA, this matters; for a cron task, it doesn’t. Provisioned concurrency exists but defeats the point of FC.
  2. VPC attachment adds another 200-400ms to cold starts because FC has to attach an ENI to your VPC. Worth it if the function needs to reach RDS/OpenSearch; skip the vpc_config block if it only calls public APIs.
  3. 24-hour max runtime. For long agent loops, FC is a bad fit. Either chunk the loop into shorter steps or use ECS.

A real example: hybrid

Most production agent stacks I’ve shipped end up hybrid:

  • ECS for the always-on conversational agent that holds session state
  • ACK for the worker fleet that processes background jobs
  • FC for webhook receivers and daily/hourly cron tasks

Terraform makes this trivial — three modules in the same project, sharing the VPC and security groups from article 3. The skill is knowing which pattern fits which workload, not learning all the resource syntaxes.

Right-sizing the instance

A common question: which ecs.* family for an agent runtime? My defaults:

WorkloadFamilyWhy
Conversational agent, no GPUecs.c7CPU-bound on tokenisation + I/O on LLM calls
Memory-heavy (large context)ecs.r7More RAM per vCPU
Batch / scheduled with burstsecs.c7a (AMD)~15% cheaper, slightly slower per core
GPU inference of small modelsecs.gn7iT4-class, cheapest GPU on Aliyun
Pretraining / large fine-tuneUse PAI-DLC, not ECSDon’t reinvent the orchestration

Avoid the burstable ecs.t6 family for agent runtime — CPU credits run out under sustained load and your latency goes off a cliff. They’re fine for the bastion that runs terraform apply and not much else.

Real-world tip: Use data.alicloud_instance_types to ask the API for “give me a 4-vCPU 16-GiB instance available in this zone right now”. Hardcoding ecs.c7.xlarge works until that exact SKU is out of stock in your zone, at which point Terraform fails. Letting the data source pick gives you graceful fallback.

What’s next

Article 5 fills in the storage layer — vector store, relational, object store, backups — that everything we just provisioned needs to talk to. ECS instances are useless until they have somewhere to put memory.

Then article 6 builds the LLM gateway in front of all the compute, article 7 wires observability and cost alarms, and article 8 stitches everything into one terraform apply.

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub