Series · Terraform Agents · Chapter 5

Terraform for AI Agents (5): Storage — Vector, Relational, and Object Memory

An agent has three kinds of memory and they map onto three Aliyun services: PolarDB/RDS for sessions, OpenSearch (vector edition) or pgvector for embeddings, OSS for artifacts. Real Terraform for each, plus the lifecycle and backup rules that keep the bill flat.

Most tutorials gloss over an agent’s memory. ‘Just put the embeddings in Pinecone, the sessions in Postgres, and the screenshots in S3.’ On Aliyun, all three are managed services. Correctly provisioning them with Terraform can mean the difference between a working memory and losing three weeks of conversation history because the disk filled up at 4 AM.

This article covers all three layers, their Terraform configurations, the critical but tedious backup and disaster recovery (DR) setup, the major version upgrade process, and the Saturday outage that changed how I do things.


The three-layer memory model#

An agent’s three kinds of memory map onto three Aliyun services

The mental model:

  • Short-term / session — what the agent did in the current run and the last few runs. Conversation turns, tool calls, intermediate state. Schema-stable, low-latency, transactional. Goes in a relational database.
  • Long-term / semantic — embeddings of documents, prior outputs, recall corpus. Hybrid lexical + vector search. Goes in a vector store.
  • Artifact / blob — generated images, PDFs, screenshots, run snapshots. Often large, write-once-read-rarely. Goes in object storage.

Don’t conflate them. I once watched a team try to put 50 GB of generated PDFs in Postgres because “it has a bytea column”. It cost ten times what OSS would have, query latency went to mush, and backups took hours. Each layer has a service that’s good at exactly its job — pick the right one and the bill stays sane.

Layer 1: relational, RDS for PostgreSQL#

For session state—turn-by-turn conversation, tool call traces, and user identity—you need a robust RDBMS. PostgreSQL is my go-to, but MySQL is fine if your team prefers it. Use PolarDB when you need horizontal scaling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
resource "random_password" "rds_admin" {
  length  = 32
  special = true
}

resource "alicloud_kms_secret" "rds_admin" {
  secret_name              = "agents-${terraform.workspace}-rds-admin"
  secret_data              = random_password.rds_admin.result
  version_id               = "v1"
  description              = "RDS admin password for agents-${terraform.workspace}"
  encryption_key_id        = module.vpc.kms_keys["secrets"]
  recovery_window_in_days  = 7
  force_delete_without_recovery = false
}

resource "alicloud_db_instance" "memory" {
  engine           = "PostgreSQL"
  engine_version   = "16.0"
  instance_type    = terraform.workspace == "prod" ? "pg.x4.large.2c" : "pg.n2.medium.1c"
  instance_storage = 100
  instance_name    = "agents-memory-${terraform.workspace}"

  vswitch_id          = module.vpc.private_vswitch_ids[0]
  security_ips        = [module.vpc.vpc_cidr_block]
  db_instance_storage_type = "cloud_essd"

  encryption_key = module.vpc.kms_keys["memory"]

  backup_period   = ["Monday", "Wednesday", "Friday"]
  backup_time     = "02:00Z-03:00Z"
  retention_period = terraform.workspace == "prod" ? 30 : 7
  log_backup_retention_period = 30

  deletion_protection = terraform.workspace == "prod"

  zone_id         = "cn-shanghai-l"
  zone_id_slave_a = terraform.workspace == "prod" ? "cn-shanghai-m" : null

  lifecycle {
    prevent_destroy = terraform.workspace == "prod"
  }
}

resource "alicloud_db_account" "agent" {
  db_instance_id   = alicloud_db_instance.memory.id
  account_name     = "agent"
  account_password = random_password.rds_admin.result
  account_type     = "Super"
}

resource "alicloud_db_database" "session" {
  instance_id   = alicloud_db_instance.memory.id
  name          = "agent_sessions"
  character_set = "UTF8"
}

What earns its line in this block:

  • Password lives in KMS Secrets Manager from birth. Generated by random_password, written to alicloud_kms_secret, retrieved by the agent at startup via STS. Plaintext never leaves Terraform’s memory and is referenced downstream by secret_id, not value, so it doesn’t sit in tfstate.
  • encryption_key ties the disk to the memory CMK. At-rest encryption, no extra cost.
  • backup_period + retention_period create automated backups three times a week, kept 30 days in prod, 7 in dev. RDS backups are stored on OSS; you don’t manage the bucket.
  • zone_id_slave_a in prod creates a hot standby in a second zone. Failover is sub-30s. The cost is 2× — worth it for prod, overkill for dev.
  • deletion_protection plus lifecycle.prevent_destroy in prod block both terraform destroy and provider-driven force-replaces. I’ll explain why both are needed in the incident section below — the short version is that one of them once saved my Saturday.

Tip. PolarDB is the right move once your sessions table crosses ~10M rows or you need read replicas without downtime. Migration from RDS to PolarDB is well-documented and Terraform handles both. Don’t start there — RDS is simpler and cheaper at small scale.

A small relational layer is the backbone. The next layer makes the agent feel intelligent.

Layer 2: vector store#

You have two reasonable choices on Aliyun for the vector layer:

Vector embedding space with similarity search for agent memory retrieval

  1. OpenSearch Vector Search Edition — managed, Lucene-backed, supports HNSW + IVF, billed per QPS quota.
  2. PolarDB or RDS PostgreSQL with pgvector — co-located with your relational data, no new infra, slower past ~1M vectors.

For anything past prototype I prefer OpenSearch. The cost is real (~¥800/mo for the smallest instance), but you get hybrid lexical+vector search, which is better for retrieval. Pure vector similarity often loses to BM25 on real queries.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
resource "alicloud_opensearch_app_group" "vector" {
  app_group_name  = "agent-vec-${terraform.workspace}"
  payment_type    = "PayAsYouGo"
  type            = "vector"
  quota {
    doc_size         = 100
    compute_resource = 20
    spec             = "opensearch.share.junior"
  }
  description = "Long-term semantic memory for agents"
}

The app group is the OpenSearch concept that holds an index. From here you create the index schema via the OpenSearch console or SDK — the alicloud_opensearch_app resource exists but the schema bit is operational, not provisionable. Pin embedding dimension (1536 for text-embedding-3-small, 1024 for Aliyun’s bge-m3) in the index settings and never change it; reindexing 10M vectors is a multi-day job.

If you go the pgvector route instead, add this to the RDS database creation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
resource "alicloud_db_database" "vectors" {
  instance_id   = alicloud_db_instance.memory.id
  name          = "agent_vectors"
  character_set = "UTF8"
}

# pgvector extension is created via your migration tool, not Terraform:
# CREATE EXTENSION IF NOT EXISTS vector;
# CREATE TABLE embeddings (id bigserial primary key, vec vector(1536), meta jsonb);
# CREATE INDEX embeddings_vec_idx ON embeddings USING hnsw (vec vector_cosine_ops);

The Terraform half is just the database; the schema is application code (Alembic, Flyway, sqlx-migrate — pick one). Don’t try to manage table schemas in Terraform; that path leads to madness, and I have the scars to prove it.

Layer 3: object storage#

OSS is for artifacts: generated images, PDFs, screenshots, run-trace tarballs, and model checkpoints if you fine-tune. For an agent stack:

Data lifecycle management from hot to cold storage tiers

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
resource "alicloud_oss_bucket" "artifacts" {
  bucket = "agents-artifacts-${terraform.workspace}-${random_id.suffix.hex}"
  acl    = "private"

  versioning {
    status = "Enabled"
  }

  server_side_encryption_rule {
    sse_algorithm     = "KMS"
    kms_master_key_id = module.vpc.kms_keys["memory"]
  }

  lifecycle_rule {
    id      = "agent-artifacts-tiering"
    enabled = true

    transitions {
      days          = 30
      storage_class = "IA"
    }
    transitions {
      days          = 90
      storage_class = "Archive"
    }
    transitions {
      days          = 365
      storage_class = "ColdArchive"
    }
    expiration {
      days = 730
    }
    noncurrent_version_expiration {
      days = 180
    }
  }

  logging {
    target_bucket = alicloud_oss_bucket.access_logs.id
    target_prefix = "artifacts-access/"
  }

  tags = {
    Domain = "agent-artifacts"
  }
}

resource "random_id" "suffix" {
  byte_length = 4
}

Three things worth a closer look.

Bucket-name uniqueness#

OSS bucket names are globally unique across all Aliyun customers — same as S3. The random_id suffix avoids the “name already taken” plan failure that bites every first-time user. Once the bucket is created, the name is stable.

Lifecycle tiering#

The lifecycle_rule block is the single biggest cost lever in OSS:

OSS lifecycle for agent artifacts

  • Standard (0–30 days, ~¥0.12/GB/mo) — what you write to by default.
  • Infrequent Access (30–90 days, ~¥0.08/GB/mo) — cheaper storage, ~¥0.0125/GB retrieval.
  • Archive (90–365 days, ~¥0.033/GB/mo) — minutes-to-hours retrieval.
  • Cold Archive (365+ days, ~¥0.015/GB/mo) — hours retrieval, the cheapest tier.

For agent artifacts, keep 30 days in Standard, two months in Infrequent Access, nine months in Archive, and one year in Cold Archive, then delete. For a 1 TB artifact corpus, this means the difference between ~¥1500/mo (all Standard) and ~¥250/mo. Codify this in HCL to save significantly over a year. The main pitfalls are the 30-day minimum storage charge for IA and the Archive retrieval latency—avoid putting hot data in cold tiers.

Versioning#

versioning { status = "Enabled" } keeps every object version. An agent that overwrites artifacts/run-123/output.pdf doesn’t actually destroy the previous version — it’s still there with a different version ID. Two reasons this matters:

  1. Recovery. A bug overwrote 50,000 objects with garbage? Restore the previous versions in a script.
  2. Tamper-evidence. Combined with WORM (Write-Once-Read-Many) policies, this gives you regulatory compliance for free.

Versioned objects accumulate, so the noncurrent_version_expiration in the lifecycle rule above prunes old versions after 180 days. Without it the storage line will quietly double every six months.

Backups, DR, and proving they work#

A Terraform-managed backup setup looks like this:

Backups: not optional, just budgeted

  • RDS — built-in automated backups (already in the HCL above).
  • OSS — versioning + cross-region replication for disaster recovery.
  • OpenSearch — snapshot to OSS via the alicloud_opensearch_* snapshot resources.

Cross-region replication for OSS is one resource:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
resource "alicloud_oss_bucket" "artifacts_dr" {
  provider = alicloud.beijing       # second-region provider alias
  bucket   = "${alicloud_oss_bucket.artifacts.bucket}-dr"
  acl      = "private"

  versioning {
    status = "Enabled"
  }

  server_side_encryption_rule {
    sse_algorithm = "AES256"        # KMS keys are region-scoped; AES256 keeps the DR bucket simple
  }
}

resource "alicloud_oss_bucket_replication" "artifacts" {
  bucket = alicloud_oss_bucket.artifacts.id

  action = "ALL"
  destination {
    bucket   = alicloud_oss_bucket.artifacts_dr.bucket
    location = "oss-cn-beijing"
  }
  enable_historical_object_replication = "enabled"

  encryption_configuration {
    replica_kms_key_id = "alias/agents-prod-memory-dr"
  }
}

The aliased provider lets one Terraform run touch two regions:

1
2
3
4
provider "alicloud" {
  alias  = "beijing"
  region = "cn-beijing"
}

For a research agent that’s mainly stateless, you might decide DR isn’t worth the storage doubling. For a customer-facing one with conversation history that legally must persist, it’s mandatory.

Prove the replica works — monthly drill#

Replication you have never restored from is a backup that doesn’t exist. The script that catches a broken replica 30 days early instead of in the actual disaster:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash
# scripts/dr-drill.sh — run on the 1st of every month from a CI cron
set -euo pipefail

PRIMARY_BUCKET="agents-artifacts-prod-abc12345"
REPLICA_BUCKET="${PRIMARY_BUCKET}-dr"
DRILL_KEY="dr-drill/$(date -Iseconds).probe"

# 1. Write a probe to primary
echo "drill $(uuidgen)" | aliyun oss put oss://$PRIMARY_BUCKET/$DRILL_KEY -

# 2. Wait for replication (typically <60s within China)
for i in {1..30}; do
  if aliyun oss --region cn-beijing stat oss://$REPLICA_BUCKET/$DRILL_KEY > /dev/null 2>&1; then
    echo "Replicated in $((i*5))s"
    break
  fi
  sleep 5
done

# 3. Verify content match
PRIMARY_HASH=$(aliyun oss cat oss://$PRIMARY_BUCKET/$DRILL_KEY | sha256sum | awk '{print $1}')
REPLICA_HASH=$(aliyun oss --region cn-beijing cat oss://$REPLICA_BUCKET/$DRILL_KEY | sha256sum | awk '{print $1}')
[[ "$PRIMARY_HASH" == "$REPLICA_HASH" ]] || { echo "HASH MISMATCH"; exit 1; }

# 4. Clean up the probe
aliyun oss rm oss://$PRIMARY_BUCKET/$DRILL_KEY
aliyun oss --region cn-beijing rm oss://$REPLICA_BUCKET/$DRILL_KEY

# 5. Notify DingTalk on success
curl -X POST "$DINGTALK_WEBHOOK" \
  -H 'Content-Type: application/json' \
  -d '{"msgtype":"text","text":{"content":"DR drill OK at '$(date -Iseconds)'"}}'

Two minutes of compute, zero human attention, and the replica health is no longer a question of faith. Wire it into the same GitHub Actions cron as the drift check from article 2. The same pattern — periodic automated drills of the things you’d otherwise discover are broken at the worst possible time — applies to RDS restore, KMS key rotation, and every single failover path.

Tip. I run a separate restore-drill.sh monthly that pulls a random RDS backup into a cn-shanghai-dr instance and runs schema/checksum verification. It’s the most useful 30 minutes I spend each month.

RDS major-version upgrades through Terraform#

Every two years Postgres has a major release and the old version goes EOL. RDS for PostgreSQL upgrades are a real operational event — they have downtime, they can fail, and the Terraform provider exposes them through engine_version changes that look innocent and aren’t.

The flow that has worked for me on a v15 → v16 upgrade:

Step 1 — snapshot before touching anything.

1
2
3
4
aliyun rds CreateBackup \
  --DBInstanceId pgm-uf6abc123 \
  --BackupMethod Physical \
  --BackupType FullBackup

Wait for completion via aliyun rds DescribeBackups. This is your “oh god” button.

Step 2 — clone to a sibling instance for the trial.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
resource "alicloud_db_instance" "memory_v16_trial" {
  engine           = "PostgreSQL"
  engine_version   = "16.0"
  instance_type    = alicloud_db_instance.memory.instance_type
  instance_storage = alicloud_db_instance.memory.instance_storage
  vswitch_id       = module.vpc.private_vswitch_ids[0]

  source_db_instance_name = alicloud_db_instance.memory.id
  backup_id               = data.alicloud_db_backups.latest.ids[0]

  instance_name = "memory-v16-trial"
}

terraform apply brings up a v16 instance with v15 data restored. QA points staging traffic at it for a week. Zero production risk.

Step 3 — when confident, upgrade in-place.

1
2
3
4
resource "alicloud_db_instance" "memory" {
  engine_version = "16.0"   # was "15.0"
  # everything else unchanged
}

terraform plan shows ~ engine_version: "15.0" -> "16.0" and the apply triggers an in-place upgrade. Downtime depends on database size — sub-1-minute for a small DB, up to 30 minutes for a multi-TB one. The upgrade is reversible only by restoring from the snapshot in step 1, so do not skip step 1.

Step 4 — tear down the trial.

1
2
3
terraform state rm alicloud_db_instance.memory_v16_trial
# remove the HCL block
# next plan: 0 changes

Then delete the trial in the console. Don’t terraform destroy it from the same project — that would walk dependencies and could touch siblings.

The whole process spans ~2 weeks calendar time, ~3 hours of focused work, zero unplanned downtime. The win of doing it through Terraform is that the trial-instance HCL stays in git — six months later when you do the v17 upgrade, the playbook is already there, in the same repo, reviewed by the same team.

Tip. Test the connection-string change in the agent code before the upgrade. Some Postgres v16 changes (e.g. removed password_encryption = md5) break old client libraries. Run your agent against the trial instance for a full day before promoting.

A real incident: the night I terraform apply-ed the wrong workspace#

This one cost me a Saturday. Worth telling because the fix is structural, not “be more careful”.

Setup: three workspaces — dev, staging, prod. The dev RDS was a small pg.n2.medium.1c. Prod was pg.x4.large.2c with HA. I was working from a laptop, switched a feature branch, ran terraform plan to check it. Saw “Plan: 2 to add, 1 to change, 0 to destroy” — looked clean. Ran terraform apply. Walked away to make coffee.

Came back to a destroyed prod database.

Root cause: I had selected prod workspace in a previous session and never switched back. The change I was applying (a tag tweak) was harmless in dev. In prod, what looked like “1 to change” was actually a force-replace because of an unrelated provider bump that had landed in main and required RDS recreation. The provider plan output should have been clearer — it wasn’t.

The HCL that bit me:

1
2
3
4
5
6
7
8
# innocent-looking change in a PR
resource "alicloud_db_instance" "memory" {
  # ... unchanged ...
  parameter {
    name  = "log_min_duration_statement"
    value = "100"   # was "1000", harmless tuning
  }
}

The provider, in its newly-bumped 1.231 version, had decided this parameter was now force_new for some engine versions. Plan said ~ parameter — looking like in-place — apply did a recreate.

Restore took 90 minutes from automated backup (mercifully recent). The post-mortem produced four structural fixes that have prevented every recurrence in two years.

Fix 1 — lifecycle { prevent_destroy = true } on prod stateful resources#

Already in the RDS block earlier in this article. Any terraform apply that would destroy a prod RDS now errors out with Resource has lifecycle.prevent_destroy set. To actually destroy you have to remove the line in HCL, file a PR, get approval, then the destroy is allowed. This single line would have stopped my Saturday outage cold.

Fix 2 — workspace prompt in the shell#

A function that yells on every terraform invocation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
function terraform() {
  local ws=$(/usr/local/bin/terraform workspace show 2>/dev/null)
  if [[ "$ws" == "prod" ]]; then
    echo -e "\033[1;31m================================\033[0m"
    echo -e "\033[1;31m WARNING: workspace = prod\033[0m"
    echo -e "\033[1;31m================================\033[0m"
    read -p "Continue? [y/N] " -n 1 -r
    echo
    [[ ! $REPLY =~ ^[Yy]$ ]] && return 1
  fi
  /usr/local/bin/terraform "$@"
}

The 1-second pause is enough to break autopilot. Lives in my .zshrc. I haven’t accidentally hit prod since.

Fix 3 — prod apply only from CI, never from a laptop#

The cleaner version: revoke your laptop’s permission to apply against the prod state file. Only the GitHub Actions runner has the RAM role with oss:PutObject on the prod state prefix. Local terraform plan works (it only reads); local apply fails with AccessDenied.

1
2
3
4
5
6
# Attached to the developer role
{
  "Effect": "Deny",
  "Action": "oss:PutObject",
  "Resource": "acs:oss:*:*:ck-tfstate-prod/agents/env:prod/*"
}

The Deny on the developer role wins over any Allow. Devs can plan; only CI can apply. The CI run is gated by PR review.

Fix 4 — a pre-apply hook that summarises destruction#

1
2
3
4
5
6
7
8
9
# .git/hooks/pre-commit (or a tflint custom rule)
plan_file=$1
n_destroy=$(terraform show -json "$plan_file" | jq '[.resource_changes[] | select(.change.actions[] == "delete")] | length')
if [[ "$n_destroy" -gt 0 ]]; then
  echo "Plan would destroy $n_destroy resources:"
  terraform show -json "$plan_file" | jq -r '.resource_changes[] | select(.change.actions[] == "delete") | .address'
  echo "Confirm with DESTROY=yes terraform apply"
  [[ "$DESTROY" != "yes" ]] && exit 1
fi

Any plan that deletes something requires DESTROY=yes in the env. You cannot type that by accident. This is the belt to the suspenders of prevent_destroy — it catches the case where the destruction is in a child module you forgot to lock down.

Take all four. None alone would have saved me; all four together make the failure mode structurally impossible.

Connecting compute to storage#

The ECS instance from article 4 needs to actually reach this storage. Three pieces:

  1. Network — already done. The agent_runtime_sg_id from the VPC module is the source for the memory_rds_sg and vector_store_sg ingress rules.
  2. Credentials — the agent reads the DB password from KMS Secrets Manager via STS:
    1
    2
    3
    
    from alibabacloud_kms20160120.client import Client as KmsClient
    resp = kms_client.get_secret_value(GetSecretValueRequest(secret_name="agents-prod-rds-admin"))
    db_password = resp.body.secret_data
    
  3. Endpoints — Terraform outputs them:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    output "rds_endpoint" {
      value = alicloud_db_instance.memory.connection_string
    }
    output "vector_endpoint" {
      value = alicloud_opensearch_app_group.vector.api_domain
    }
    output "artifacts_bucket" {
      value = alicloud_oss_bucket.artifacts.bucket
    }
    

The agent reads these from environment variables that cloud-init sets from the Terraform outputs. No hardcoded endpoints, no manual config files, no human in the loop on rotation.

What it costs#

Monthly, dev workspace, low traffic:

  • RDS PostgreSQL (pg.n2.medium.1c, 100 GB ESSD): ~¥350/mo.
  • OpenSearch vector (smallest spec): ~¥800/mo.
  • OSS (10 GB Standard, lifecycle on): ~¥1.5/mo + traffic.
  • KMS (covered in article 3): ~¥10/mo.

Roughly ¥1200/mo for the storage layer in dev. Prod with HA RDS, larger OpenSearch, more OSS, and the cross-region replica will be ¥3000–5000/mo. This is where the cost pressure starts being real — article 7 shows how to track and alert on it before it surprises you in the monthly bill review.

What’s Next#

Article 6 builds the LLM gateway in front of the compute we provisioned in article 4 and the storage we just provisioned. That’s the place where API keys live, quotas get enforced, and per-agent cost gets attributed. By the end of article 6 you’ll have a complete agent-runnable stack — the last two articles wire observability and cost control over the top.

In this series

Terraform Agents 8 parts

  1. 01 Terraform for AI Agents (1): Why IaC Is the Only Sane Way to Ship Agents
  2. 02 Terraform for AI Agents (2): Provider, Auth, and Remote State on OSS
  3. 03 Terraform for AI Agents (3): A Reusable VPC and Security Baseline
  4. 04 Terraform for AI Agents (4): Compute — ECS, ACK, or Function Compute?
  5. 05 Terraform for AI Agents (5): Storage — Vector, Relational, and Object Memory you are here
  6. 06 Terraform for AI Agents (6): LLM Gateway and Secrets Management
  7. 07 Terraform for AI Agents (7): Observability, SLS Dashboards, and Cost Alarms
  8. 08 Terraform for AI Agents (8): End-to-End — research-agent-stack in One Apply

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub