Terraform for AI Agents (2): Provider, Auth, and Remote State on OSS
Pinning the alicloud provider, picking between AK/SK, AssumeRole, and ECS RAM role auth, putting tfstate on OSS with Tablestore locking, and the workspace pattern that keeps dev/staging/prod from stomping each other. Plus the dozen failure modes that bite first-timers.
This is the article where you stop reading and start typing. By the end you will have:
- The
alicloudTerraform provider installed and version-pinned - Authentication wired up — through the right method, not the convenient one
- Remote state on an OSS bucket with Tablestore-based locking
- Three workspaces (
dev,staging,prod) that share a backend but isolate state - A working
terraform planagainst an empty config
Nothing here provisions an agent yet. We’re laying the foundation that every later article assumes.
Step 0: install Terraform
I won’t dwell — the official Install Terraform doc covers all OSes. On macOS:
| |
Pin to a recent stable. The Aliyun docs are tested against >= 0.12, but on a fresh project you should use >= 1.9. There are real ergonomic improvements in newer versions (for_each, optional(), refined moved blocks).
Step 1: pin the provider
Create a project directory and a versions.tf:
| |
The ~> 1.230 constraint allows 1.230.0 through 1.230.x but blocks 1.231.0. This is the right default. Once you commit .terraform.lock.hcl to git (Terraform creates it on terraform init), you also lock the exact provider version and its checksum. If a teammate runs terraform init later, they get the same provider — bit-identical.
Pinning early is cheap insurance. The alicloud provider has shipped breaking changes between minor versions (last big one was the OSS bucket schema rework around 1.220). You will eventually need to upgrade — do it deliberately, in a PR, with the diff in plan output, not by accident on a teammate’s laptop.
Step 2: authenticate — three options, ranked
The provider needs Aliyun credentials. There are three real choices, in increasing order of professional acceptability:

Option A: static AK/SK (only on a personal laptop)
| |
The provider auto-discovers these env vars. Do not — under any circumstances — write the keys into your .tf files. The state file does not store the secret; the provider {} block does, and that block is committed to git.
If the AK/SK is for a sub-account scoped to only the resources Terraform manages, this is acceptable for a solo project. For anything shared, skip to option B.
Option B: AssumeRole (CI runners)
CI runners shouldn’t carry long-lived AKs. Instead, give the CI runner an AK with one permission only — sts:AssumeRole on a target role — and have Terraform assume that role at apply time:
| |
The role has the actual write permissions; the AK only has the right to assume it. STS sessions are short-lived (one hour by default), audit-logged in ActionTrail, and can be revoked instantly by detaching the trust policy. This is the model GitLab CI, GitHub Actions, and Jenkins runners should use.
Option C: ECS RAM role (the bastion / IaC service runner)
If terraform apply runs on an Aliyun ECS instance — say, your team’s ops bastion or the Aliyun-hosted IaC Service runner — attach a RAM role to the instance and the provider picks credentials up automatically from instance metadata:
| |
Zero secrets in any config, in any env var, in any file. Rotation is automatic. This is the gold standard.
Real-world tip: Whatever you pick, set
ALICLOUD_REGION(orprovider { region = ... }) explicitly. If unset, the provider does not pick a default — you get a confusing “Region must be specified” error onterraform planthat has tripped me up more than once.
Step 3: state — why local tfstate is a footgun
When you run terraform apply, by default Terraform writes terraform.tfstate in the current directory. That file is the source of truth for what infrastructure exists. Three things will go wrong:
- Loss. Delete the directory and Terraform thinks nothing exists. Next
applytries to recreate everything (or fails on duplicates). - Conflict. Two engineers running
applysimultaneously can corrupt the state file. - Secrets in plaintext. Some resource attributes (database passwords, key material) end up in tfstate. Leaving it on a laptop is bad. Committing it to git is worse — and people do.
The fix is remote state with state locking. On Aliyun, the canonical setup is OSS + Tablestore:

OSS holds the actual terraform.tfstate file (with versioning enabled — recovery is one CLI command if something corrupts). Tablestore holds a tiny “lock” row that Terraform writes before any apply and deletes after. If a second apply starts while the first holds the lock, the second one waits or fails — never both running at once.
Step 4: bootstrap the backend (chicken-and-egg)
The OSS bucket and Tablestore that hold our backend… need to exist before the backend can use them. The honest workflow is to provision them in a tiny one-off bootstrap/ directory using a local state file, then never touch it again.
| |
terraform init && terraform apply from inside bootstrap/. About 30 seconds. Then archive the local tfstate somewhere (I keep it in 1Password as a sanity backup) and never run from this directory again.
Step 5: configure the backend
Back in your real project, add:
| |
The prefix lets you stash multiple state files in one bucket — handy when you split your infra into multiple Terraform projects later. encrypt = true enables OSS-side encryption (we already turned on the bucket-level KMS rule, but defense-in-depth never hurts).
Run:
| |
If this fails with “AccessDenied”, your auth role doesn’t have oss:GetObject/PutObject on the bucket. The minimum role policy is:
| |
Apply this to the role you authenticate with. Don’t grant oss:* — least privilege matters even for backend roles, because that role is in your CI runner.
Step 6: workspaces for env isolation
A workspace is a separate state file inside the same backend. The default workspace is — usefully — called default. Create the others you need:
| |
Inside HCL, terraform.workspace resolves to the current workspace name, which lets you parameterise resource sizes:
| |
A clean alternative is one *.tfvars file per env:
| |
I use tfvars files for “configuration that obviously differs” (CIDR blocks, region, instance counts) and terraform.workspace only for the conditional is_prod toggle. Mixing both is fine — pick one as the primary mechanism per project.
Step 7: the five-command loop
Day-to-day Terraform is just five commands:

| |
Three rules:
- Always read the plan output before applying. It tells you exactly what’s about to happen — which resources will create (
+), update in-place (~), force replace (-/+), or destroy (-). The replace-in-place arrows in particular hide downtime. - Make
planandapplytwo steps in CI. Runterraform plan -out=tfplan, post the plan output to the PR, get human approval, thenterraform apply tfplanon merge. Never auto-apply on push. - Don’t rush past
state.terraform state listshows everything you currently manage;terraform state show <addr>shows one resource’s full attributes. When you’re debugging weird drift, this is where you start.
The eight failure modes you will hit on day one
In the order they happened to me:
Error: Failed to query available provider packagesonterraform init. GFW. SetHTTPS_PROXYor use the officialConfigure an acceleration solution for Terraform initializationdoc — the registry mirror ishttps://mirrors.aliyun.com/terraform/.Error: state lock. You hit Ctrl-C during a previous apply and the lock is stale. Runterraform force-unlock <LOCK_ID>(the ID is in the error). Verify nothing’s running first.Error: Region must be specified. SetALICLOUD_REGIONenv var orregionin theproviderblock.AccessDeniedon backend init. RAM permissions on the OSS bucket prefix. Re-check step 5’s policy.InvalidParameter.NotFoundon Tablestore. You bootstrapped the wrong region. Tablestore endpoint and OSS bucket region must match.Provider produced inconsistent result after apply. Almost always a stale.terraform/cache after a provider version bump.rm -rf .terraform .terraform.lock.hcl && terraform init.Resource already exists. You created the resource by hand in the console. Either delete it or import:terraform import alicloud_vpc.main vpc-uf6xxxxxx.- A
terraform plandiff you didn’t expect on a freshly-applied resource. “Drift”. Either someone touched the resource in the console, or the provider’s read logic differs from create. Look at the specific attributes in the diff; usually the fix is to set the attribute explicitly so Terraform stops “noticing” the difference.
Real-world tip: Run
terraform planimmediately after everyapply, even on no changes. The plan should be empty. If it isn’t, you have drift, and the longer you let drift live the harder it is to reconcile.
What’s next
Article 3 builds the first real piece of infrastructure: a reusable vpc-baseline module. VPC, three vSwitches across three zones, NAT gateway, EIP, security group baseline, KMS key. We will use it in every subsequent article and it is the single most copy-pasted module in my agent stacks.
If this article worked end-to-end for you, you should now be able to run terraform init, terraform workspace select dev, terraform plan and see “No changes.” That’s the foundation. Everything else stacks on top of it.