Series · Terraform Agents · Chapter 5

Terraform for AI Agents (5): Storage — Vector, Relational, and Object Memory

An agent has three kinds of memory and they map onto three Aliyun services: PolarDB/RDS for sessions, OpenSearch (vector edition) or pgvector for embeddings, OSS for artifacts. Real Terraform for each, plus the lifecycle and backup rules that keep the bill flat.

An agent’s memory is the part most tutorials hand-wave. “Just put the embeddings in Pinecone, the sessions in Postgres, the screenshots in S3.” On Aliyun, all three exist as managed services, and Terraform-provisioning them right is the difference between “memory works” and “we lost three weeks of conversation history because the disk filled up at 4am”.

This article covers all three layers, the Terraform for each, and the boring-but-critical lifecycle and backup rules.

The three-layer memory model

An agent’s three kinds of memory map onto three Aliyun services

The mental model:

  • Short-term / session — what the agent did in the current run and the last few runs. Conversation turns, tool calls, intermediate state. Schema-stable, low-latency, transactional. Goes in a relational database.
  • Long-term / semantic — embeddings of documents, prior outputs, and recall corpus. Hybrid lexical + vector search. Goes in a vector store.
  • Artifact / blob — generated images, PDFs, screenshots, run snapshots. Sometimes large, often write-once-read-rarely. Goes in object storage.

Don’t conflate them. I have watched a team try to put 50 GB of generated PDFs in Postgres because “it has a bytea column”. Cost ten times what OSS would have, query latency went to mush, backups took hours.

Layer 1: relational, RDS for PostgreSQL

For session state — turn-by-turn conversation, tool-call traces, user identity — you want a real RDBMS. PostgreSQL is my default; MySQL works fine if your team prefers it. PolarDB is the next step up when you need horizontal scale.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
resource "random_password" "rds_admin" {
  length  = 32
  special = true
}

resource "alicloud_kms_secret" "rds_admin" {
  secret_name              = "agents-${terraform.workspace}-rds-admin"
  secret_data              = random_password.rds_admin.result
  version_id               = "v1"
  description              = "RDS admin password for agents-${terraform.workspace}"
  encryption_key_id        = module.vpc.kms_keys["secrets"]
  recovery_window_in_days  = 7
  force_delete_without_recovery = false
}

resource "alicloud_db_instance" "memory" {
  engine           = "PostgreSQL"
  engine_version   = "16.0"
  instance_type    = terraform.workspace == "prod" ? "pg.x4.large.2c" : "pg.n2.medium.1c"
  instance_storage = 100
  instance_name    = "agents-memory-${terraform.workspace}"

  vswitch_id          = module.vpc.private_vswitch_ids[0]
  security_ips        = [module.vpc.vpc_cidr_block]
  db_instance_storage_type = "cloud_essd"

  encryption_key = module.vpc.kms_keys["memory"]

  backup_period   = ["Monday", "Wednesday", "Friday"]
  backup_time     = "02:00Z-03:00Z"
  retention_period = terraform.workspace == "prod" ? 30 : 7
  log_backup_retention_period = 30
  preferred_backup_period = ["Monday", "Wednesday", "Friday"]

  deletion_protection = terraform.workspace == "prod"

  zone_id = "cn-shanghai-l"
  zone_id_slave_a = terraform.workspace == "prod" ? "cn-shanghai-m" : null
}

resource "alicloud_db_account" "agent" {
  db_instance_id   = alicloud_db_instance.memory.id
  account_name     = "agent"
  account_password = random_password.rds_admin.result
  account_type     = "Super"
}

resource "alicloud_db_database" "session" {
  instance_id   = alicloud_db_instance.memory.id
  name          = "agent_sessions"
  character_set = "UTF8"
}

Highlights:

  • Password lives in KMS Secrets Manager from birth. Generated by random_password, written to alicloud_kms_secret, retrieved by the agent at startup via STS. The plaintext password never leaves Terraform’s memory and isn’t in tfstate (it’s referenced via secret_id).
  • encryption_key ties the disk to the memory CMK. At-rest encryption, no extra cost.
  • backup_period + retention_period create automated backups three times a week, kept 30 days in prod, 7 in dev. RDS backups are stored on OSS; you don’t manage the bucket.
  • zone_id_slave_a in prod creates a hot standby in a second zone. Failover is sub-30s. The cost is 2× — worth it for prod, overkill for dev.
  • deletion_protection in prod blocks terraform destroy from killing the database. Always.

Real-world tip: PolarDB is the right choice once your sessions table crosses ~10M rows or you need read replicas without downtime. Migration from RDS to PolarDB is well-documented and Terraform handles both. Don’t start there — RDS is simpler and cheaper at small scale.

Layer 2: vector store

You have two reasonable choices on Aliyun for the vector layer:

  1. OpenSearch Vector Search Edition — managed, Lucene-backed, supports HNSW + IVF, billed per QPS quota
  2. PolarDB or RDS PostgreSQL with pgvector — co-located with your relational data, free in terms of new infra, slower past ~1M vectors

For anything past prototype, I prefer OpenSearch. The cost is real (~¥800/mo for the smallest instance), but you get hybrid lexical+vector search out of the box, which is the right shape for retrieval.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
resource "alicloud_opensearch_app_group" "vector" {
  app_group_name  = "agent-vec-${terraform.workspace}"
  payment_type    = "PayAsYouGo"
  type            = "vector"
  quota {
    doc_size   = 100
    compute_resource = 20
    spec       = "opensearch.share.junior"
  }
  description = "Long-term semantic memory for agents"
}

The app group is the OpenSearch concept that holds an index. From here you create the index schema via the OpenSearch console or SDK — the alicloud_opensearch_app resource exists but the schema bit is operational, not provisional.

If you go the pgvector route instead, add this to the RDS database creation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
resource "alicloud_db_database" "vectors" {
  instance_id   = alicloud_db_instance.memory.id
  name          = "agent_vectors"
  character_set = "UTF8"
}

# pgvector extension is created via your migration tool, not Terraform:
# CREATE EXTENSION IF NOT EXISTS vector;
# CREATE TABLE embeddings (id bigserial primary key, vec vector(1536), meta jsonb);
# CREATE INDEX embeddings_vec_idx ON embeddings USING hnsw (vec vector_cosine_ops);

The Terraform half is just the database; the schema is application code (Alembic, Flyway, sqlx-migrate — pick one). Don’t try to manage table schemas in Terraform; that path leads to madness.

Layer 3: object storage

OSS is where artifacts go: generated images, PDFs, screenshots, run-trace tarballs, model checkpoints if you fine-tune.

The official “Create a bucket with Terraform” practice doc covers the basics. For an agent stack:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
resource "alicloud_oss_bucket" "artifacts" {
  bucket = "agents-artifacts-${terraform.workspace}-${random_id.suffix.hex}"
  acl    = "private"

  versioning {
    status = "Enabled"
  }

  server_side_encryption_rule {
    sse_algorithm   = "KMS"
    kms_master_key_id = module.vpc.kms_keys["memory"]
  }

  lifecycle_rule {
    id      = "agent-artifacts-tiering"
    enabled = true

    transitions {
      days          = 30
      storage_class = "IA"
    }
    transitions {
      days          = 90
      storage_class = "Archive"
    }
    transitions {
      days          = 365
      storage_class = "ColdArchive"
    }
    expiration {
      days = 730
    }
  }

  logging {
    target_bucket = alicloud_oss_bucket.access_logs.id
    target_prefix = "artifacts-access/"
  }

  tags = {
    Domain = "agent-artifacts"
  }
}

resource "random_id" "suffix" {
  byte_length = 4
}

Three things worth a closer look:

Bucket-name uniqueness

OSS bucket names are globally unique across all Aliyun customers. The random_id suffix avoids the “name already taken” plan failure that bites every first-time user. Once the bucket is created, the name is stable.

Lifecycle tiering

The lifecycle_rule block is the single biggest cost lever in OSS:

OSS lifecycle for agent artifacts

  • Standard (0-30 days, ~¥0.12/GB/mo) — what you write to by default
  • Infrequent Access (30-90 days, ~¥0.08/GB/mo) — cheaper storage, $0.0125 per GB retrieval
  • Archive (90-365 days, ~¥0.033/GB/mo) — minutes-to-hours retrieval
  • Cold Archive (365+ days, ~¥0.015/GB/mo) — hours retrieval, the cheapest

For agent artifacts, this rule says: keep 30 days hot, then move to IA, then Archive at 3 months, then Cold Archive at a year, then delete at two years. For a 1 TB artifact corpus, this is the difference between ¥1500/mo (all Standard) and ~¥250/mo. Codify it in HCL once, save 5 figures over a year.

Versioning

versioning { status = "Enabled" } keeps every object version. An agent that overwrites artifacts/run-123/output.pdf doesn’t actually destroy the previous version — it’s still there with a different version ID. Two reasons this matters:

  1. Recovery. A bug overwrote 50,000 objects with garbage? Restore the previous versions in a script.
  2. Tamper-evidence. Combined with WORM (Write-Once-Read-Many) policies, this gives you regulatory compliance for free.

The cost is real — versioned objects accumulate. Pair versioning with a noncurrent_version_expiration rule in the lifecycle to prune old versions after, say, 180 days.

The backup story

A Terraform-managed backup setup looks like this:

Backups: not optional, just budgeted

  • RDS: built-in automated backups (already in our HCL above)
  • OSS: versioning + cross-region replication for disaster recovery
  • OpenSearch: snapshot to OSS via the alicloud_opensearch_* snapshot resources

Cross-region replication for OSS is one resource:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
resource "alicloud_oss_bucket" "artifacts_dr" {
  provider = alicloud.beijing       # second-region provider alias
  bucket   = "${alicloud_oss_bucket.artifacts.bucket}-dr"
  acl      = "private"

  versioning {
    status = "Enabled"
  }

  server_side_encryption_rule {
    sse_algorithm = "AES256"        # KMS keys are region-scoped, simpler to use AES256 for DR
  }
}

resource "alicloud_oss_bucket_replication" "artifacts" {
  bucket = alicloud_oss_bucket.artifacts.id

  action = "ALL"
  destination {
    bucket   = alicloud_oss_bucket.artifacts_dr.bucket
    location = "oss-cn-beijing"
  }
  enable_historical_object_replication = "enabled"

  encryption_configuration {
    replica_kms_key_id = "alias/agents-prod-memory-dr"
  }
}

The aliased provider lets one Terraform run touch two regions:

1
2
3
4
provider "alicloud" {
  alias  = "beijing"
  region = "cn-beijing"
}

For a research agent that’s mainly stateless, you might decide DR isn’t worth the storage doubling. For a customer-facing one with conversation history that legally must persist, it’s mandatory.

Real-world tip: Test the restore quarterly. A backup you have never restored is just an expensive hope. I run a restore-drill.sh script monthly that pulls a random RDS backup into a cn-shanghai-dr instance and runs schema/checksum verification. It is the most useful 30 minutes I spend each month.

Connecting compute to storage

The ECS instance from article 4 needs to actually reach this storage. Three pieces:

  1. Network — already done. The agent_runtime_sg_id from the VPC module is the source for the memory_rds_sg and vector_store_sg ingress rules.
  2. Credentials — the agent reads the DB password from KMS Secrets Manager via STS:
    1
    2
    3
    
    from alibabacloud_kms20160120.client import Client as KmsClient
    resp = kms_client.get_secret_value(GetSecretValueRequest(secret_name="agents-prod-rds-admin"))
    db_password = resp.body.secret_data
    
  3. Endpoints — Terraform outputs them:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    output "rds_endpoint" {
      value = alicloud_db_instance.memory.connection_string
    }
    output "vector_endpoint" {
      value = alicloud_opensearch_app_group.vector.api_domain
    }
    output "artifacts_bucket" {
      value = alicloud_oss_bucket.artifacts.bucket
    }
    

The agent reads these from environment variables that cloud-init sets from the Terraform outputs. No hardcoded endpoints, no manual config files.

What it costs (monthly, dev workspace, low traffic)

  • RDS PostgreSQL (pg.n2.medium.1c, 100 GB ESSD): ~¥350/mo
  • OpenSearch vector (smallest): ~¥800/mo
  • OSS (10 GB Standard, lifecycle on): ~¥1.5/mo + traffic
  • KMS (covered in article 3): ~¥10/mo

Roughly ¥1200/mo for the storage layer in dev. Prod with HA RDS, larger OpenSearch, more OSS will be ¥3000-5000/mo. This is where the cost pressure starts being real — article 7 shows how to track and alert on it.

What’s next

Article 6 builds the LLM gateway in front of the compute we provisioned in article 4 and the storage we just provisioned. That’s the place where API keys live, quotas get enforced, and per-agent cost gets attributed. By the end of article 6 you’ll have a complete agent-runnable stack — the last two articles wire observability and cost control over the top.

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub