Infrastructure as Code on Chen Kai Blog

Alibaba Cloud Full Stack (12): End-to-End — One Terraform Apply for Everything

Sat, 09 May 2026 09:00:00 +0000

Eleven articles. Dozens of CLI commands. Hundreds of manual steps. Now we throw all of that away and rebuild the entire stack with a single terraform apply. This is why infrastructure-as-code exists.

Over the past eleven parts of this series, we have clicked through consoles, typed aliyun CLI commands, and manually configured everything from VPCs to Function Compute triggers. It worked. We learned every resource intimately because we built each one by hand. But if I asked you right now to recreate that entire stack in a new region — the VPC with its three tiers and two availability zones, the ECS instance with its cloud-init script, the RDS MySQL HA setup, the OSS bucket with lifecycle rules, the RAM policies, the SLS log pipeline, the Function Compute event processing — you would need at least a full day of careful work. And you would inevitably miss something. A security group rule. A backup policy. A CORS configuration.

Terraform for AI Agents (2): Provider, Auth, and Remote State on OSS

Sat, 14 Mar 2026 09:00:00 +0000

This article is where you stop reading and start typing. By the end, you’ll have:

The alicloud Terraform provider installed and version-pinned.
Authentication wired up — through the right method, not the convenient one.
Remote state on an OSS bucket with Tablestore-based locking.
Three workspaces (dev, staging, prod) that share a backend but isolate state.
A working terraform plan against an empty config.

Nothing here provisions an agent yet. This lays the foundation for all future articles. If you skip this and try to wing it in article 3, you’ll likely face a state-corruption incident within a week.

Terraform for AI Agents (1): Why IaC Is the Only Sane Way to Ship Agents

Thu, 12 Mar 2026 09:00:00 +0000

I have shipped four agent systems on Alibaba Cloud in the last eighteen months. Three of them started life as a tmux session on a single ECS instance someone created by clicking through the console. All three of those needed a panicked weekend of rebuilding when the second engineer joined the project, when the prod region had a stockout, or when the security team asked for a network diagram.

The fourth started life as terraform apply. It was the only one I haven’t lost a weekend to.

Cloud Computing (7): Cloud Operations and DevOps Practices

Fri, 26 May 2023 09:00:00 +0000

In 2017 GitLab lost six hours of database state. An engineer, exhausted, ran rm -rf on the wrong server during an incident. The backup procedures had silently been broken for months; nobody noticed because no one was restoring from backups. The lesson is not “be careful with rm”. The lesson is that operations is a system — tools, runbooks, monitoring, automation, and the rituals around them. When the system is healthy, no single tired engineer can take down production. When the system is rotten, every late-night fix is one keystroke from disaster.