Alibaba Cloud on Chen Kai Blog

Alibaba Cloud Full Stack (12): End-to-End — One Terraform Apply for Everything

Sat, 09 May 2026 09:00:00 +0000

Eleven articles. Dozens of CLI commands. Hundreds of manual steps. Now we throw all of that away and rebuild the entire stack with a single terraform apply. This is why infrastructure-as-code exists.

Over the past eleven parts of this series, we have clicked through consoles, typed aliyun CLI commands, and manually configured everything from VPCs to Function Compute triggers. It worked. We learned every resource intimately because we built each one by hand. But if I asked you right now to recreate that entire stack in a new region — the VPC with its three tiers and two availability zones, the ECS instance with its cloud-init script, the RDS MySQL HA setup, the OSS bucket with lifecycle rules, the RAM policies, the SLS log pipeline, the Function Compute event processing — you would need at least a full day of careful work. And you would inevitably miss something. A security group rule. A backup policy. A CORS configuration.

Alibaba Cloud Full Stack (11): PAI — The ML Platform

Fri, 08 May 2026 09:00:00 +0000

Training a model on a single GPU is fun. Deploying it to handle 1,000 requests per second without failing is what separates experiments from products. PAI handles both.

PAI (Platform for AI) is Alibaba Cloud’s managed ML platform. It’s not just one product; it’s five products in a trench coat, sharing a console. These include a notebook environment for exploration, a distributed training service for scale, a model serving platform for production, a visual pipeline designer for those who prefer dragging boxes, and a model gallery for one-click deployment of open-source models. After eighteen months of running real LLM workloads on it, I can say that the individual components range from excellent (EAS) to good enough (Designer). The whole platform is genuinely greater than the sum of its parts once you understand how they connect.

Alibaba Cloud Full Stack (10): Bailian and DashScope — The LLM Layer

Thu, 07 May 2026 09:00:00 +0000

When I first needed an LLM API for a production app in China, my options were limited and expensive. Most international providers had no mainland endpoint, billing required a foreign credit card, and latency from calling US-based APIs was 800ms+ before a single token came back. Then Qwen showed up on DashScope with an OpenAI-compatible endpoint, and suddenly building AI products in China became as straightforward as anywhere else. Same SDK, same request shape, same streaming protocol — just a different base_url and a key from the Bailian console. I have been running production workloads against it for over a year now, and this article is the comprehensive walkthrough I wish I had on day one.

Alibaba Cloud Full Stack (9): OpenSearch and AI Search

Wed, 06 May 2026 09:00:00 +0000

I built my first search engine with Elasticsearch and a pile of synonyms. It took six months to get decent results. Every week, users complained about missing results, so I added more synonyms, broke something else, and added exception rules. The relevance tuning spreadsheet grew to 400 rows. I had custom analyzers for three languages, a boosting config that no one understood (including me), and a reindexing job that took four hours. Then I tried hybrid vector+keyword search on a side project and got better results on day one. Not marginally better — “users stopped complaining” better. That experience completely changed how I think about search, and it’s the reason this article exists.

Alibaba Cloud Full Stack (8): Serverless — Function Compute and EventBridge

Tue, 05 May 2026 09:00:00 +0000

The first time I saw a Function Compute bill that was 0.03 CNY for handling 10,000 requests, I started rethinking my entire architecture. I had been running a 2-vCPU ECS instance 24/7 to serve an API that processed maybe 200 requests per hour, paying around 490 CNY/month. The same workload on Function Compute cost under 5 CNY/month. Not 5 CNY per day — 5 CNY per month. The math was so lopsided that I spent the next weekend migrating everything that did not need a persistent process off ECS and onto functions.

Alibaba Cloud Full Stack (7): SLS, CloudMonitor, and Observability

Mon, 04 May 2026 09:00:00 +0000

The worst production outage I ever caused took three hours to diagnose. A Node.js service was returning 502s intermittently — maybe 5% of requests — and I had nothing. No centralized logs (each ECS instance had its own /var/log/ and I was SSH-ing into them one at a time). No metrics dashboards (I was running top and df -h in terminals). No tracing (I was adding console.log timestamps to try to figure out which downstream call was hanging). Three hours later, I found the issue: a connection pool to RDS was exhausting under load because a forgotten cron job was holding connections open. The fix was two lines of code. The diagnosis took three hours of misery because I had zero observability.

Alibaba Cloud Full Stack (6): RAM, KMS, and Cloud Security

Sun, 03 May 2026 09:00:00 +0000

I once found a DashScope API key hardcoded in a public GitHub repo. It was mine. Someone had forked a demo I pushed months earlier, and the key was sitting in a config file I forgot to gitignore. By the time I noticed, the key had been used to generate 14,000 Qwen API calls in a single weekend. The bill was not catastrophic — DashScope per-token pricing is forgiving — but the lesson was. I had treated cloud security as something I would figure out later. “Later” arrived as a billing alert at 2 AM on a Sunday.

Alibaba Cloud Full Stack (5): RDS and PolarDB — The Database Layer

Sat, 02 May 2026 09:00:00 +0000

My self-managed MySQL on ECS lasted exactly four months before a disk I/O spike during peak traffic brought the whole thing down. The InnoDB buffer pool was fighting the OS page cache for memory, the binary log was filling the system disk faster than my cron job could rotate it, and the single-threaded replication to my “backup” instance was nine hours behind. I fixed it at 3 AM by throwing more disk at it. Then it happened again two weeks later. That is the day I learned why managed databases exist — not because I cannot run MySQL, but because I do not want to be the person paged at 3 AM when MySQL decides the relay log is corrupted and the only fix is to rebuild the replica from a cold backup that may or may not be consistent.

Alibaba Cloud Full Stack (4): OSS — Object Storage Done Right

Fri, 01 May 2026 09:00:00 +0000

I used to store user uploads on the ECS disk. Profile pictures, PDF invoices, CSV exports — all dumped into /var/data/uploads/ on a single ecs.g7.large running my Flask app. I had a cron job that rsynced the directory to a second ECS instance every six hours as a “backup.” Then one Friday at 3am, the system disk hit 100% because a batch job generated 40GB of reports nobody ever downloaded, the instance went read-only, the app crashed, and the rsync hadn’t run since the previous evening. I lost six hours of user uploads and spent the weekend apologizing to customers. That was the week I learned that object storage is not a nice-to-have — it is the foundation of everything you build in the cloud. Your application server is ephemeral. Your data is not.

Alibaba Cloud Full Stack (3): VPC, SLB, and the Network Layer

Thu, 30 Apr 2026 09:00:00 +0000

Every outage I have debugged in the cloud ultimately traced back to networking. Bad CIDR planning that ran out of IPs six months in. Missing routes that silently dropped traffic between tiers. Security groups that were either wide open (hello, port 22 to 0.0.0.0/0) or so locked down that health checks failed and the load balancer kept draining healthy instances. Getting the network layer right is the single most important thing you can do before deploying anything else, and it is the single most painful thing to fix retroactively because changing a VPC CIDR means recreating everything inside it.

Alibaba Cloud Full Stack (2): ECS — Compute That Actually Makes Sense

Wed, 29 Apr 2026 09:00:00 +0000

The first ECS instance I ever launched was wildly over-provisioned. I picked the biggest instance I could find — an ecs.r6.8xlarge with 32 vCPUs and 256 GiB RAM — to run a Flask app that served maybe 20 requests per minute. I burned through credits in a week, panicked, learned how to downsize online, and discovered my app ran perfectly on a 2-vCPU box costing 94% less. Right-sizing matters more than raw power, and understanding the compute layer is the single most useful thing you can learn about any cloud platform.

Alibaba Cloud Full Stack (1): The Ecosystem Map — What Alibaba Cloud Actually Is

Tue, 28 Apr 2026 09:00:00 +0000

I spent my first week on Alibaba Cloud completely lost in a sea of product names. ECS, SLB, SLS, RDS, OSS, NAS, PAI, ARMS, ACK, FC, CDN, WAF, RAM, KMS, ROS, CloudMonitor, EventBridge, PolarDB, Lindorm, AnalyticDB, MaxCompute, DataWorks, Flink, DashScope, Bailian, OpenSearch… Every console page links to three more products I haven’t heard of. The documentation assumes you already know what everything is. The English translations are sometimes literal, sometimes creative, and occasionally missing. This is the guide I wish someone had handed me before I burned my first weekend clicking through consoles and reading translated docs that explained feature flags without ever explaining what the product does.

Terraform for AI Agents (8): End-to-End — research-agent-stack in One Apply

Thu, 26 Mar 2026 09:00:00 +0000

This is where everything from articles 2 through 7 lands in one place. By the end you’ll have run terraform apply once and produced a complete, observable, budgeted agent runtime stack on Alibaba Cloud — about 31 resources, ~7 minutes of wall clock, ¥12,530/month all-in at prod sizing.

The stack we’re building:

Five layers — edge, compute, memory, platform, ops — composed from the modules we built across this series. Eleven Aliyun products under the hood: VPC, ECS, ALB, OSS, RDS for PostgreSQL, OpenSearch, KMS, SLS, ARMS, CloudMonitor, and DashScope (the LLM provider, accessed via the gateway).

Terraform for AI Agents (7): Observability, SLS Dashboards, and Cost Alarms

Tue, 24 Mar 2026 09:00:00 +0000

Agents are non-deterministic, multi-step, and call expensive APIs. This combination means you can’t debug them after the fact unless you instrumented them from the start. This article sets up three pipelines through Terraform — logs, traces, and metrics — into a unified dashboard, adds six SLS queries to solve real incidents, and sets up four alarms that have actually fired and saved my projects in production.

By the end, you’ll have a DingTalk channel that alerts you before the bill explodes, latency increases, the error rate spikes, or an agent starts looping on itself — plus SLO budgets that turn operational feelings into data.

Terraform for AI Agents (6): LLM Gateway and Secrets Management

Sun, 22 Mar 2026 09:00:00 +0000

A pattern I see repeatedly in immature agent stacks: each agent has its own copy of OPENAI_API_KEY in its own .env file. Sometimes the same key, sometimes different ones, sometimes a colleague’s personal key from when they prototyped. When the bill arrives nobody can tell which agent caused which token spend, and when a key leaks (it always does) you’re playing whack-a-mole across a dozen .env files.

The real wake-up call hit me two years ago. A contractor finished his three-month engagement on a Friday, his laptop went home, and on the following Tuesday DashScope billing flagged 12 million tokens of qwen-max traffic from an IP we didn’t recognise. His personal API key — copy-pasted into a side project — was still sitting in our agent’s .env. Rotating it took six hours: three engineers, four repos, two CI pipelines, one panicked Slack thread. Never again.

Terraform for AI Agents (5): Storage — Vector, Relational, and Object Memory

Fri, 20 Mar 2026 09:00:00 +0000

Most tutorials gloss over an agent’s memory. ‘Just put the embeddings in Pinecone, the sessions in Postgres, and the screenshots in S3.’ On Aliyun, all three are managed services. Correctly provisioning them with Terraform can mean the difference between a working memory and losing three weeks of conversation history because the disk filled up at 4 AM.

This article covers all three layers, their Terraform configurations, the critical but tedious backup and disaster recovery (DR) setup, the major version upgrade process, and the Saturday outage that changed how I do things.

Terraform for AI Agents (4): Compute — ECS, ACK, or Function Compute?

Wed, 18 Mar 2026 09:00:00 +0000

The single most important architectural decision in an agent system is where the agent loop process runs. There are three good options on Aliyun, plus a fourth that almost everyone forgets. Picking the wrong one isn’t catastrophic — you can migrate later — but it costs weeks of unnecessary work and several thousand RMB a month in idle compute.

This article covers all four options with working Terraform, cost crossovers, and operational gotchas I often encounter.

Terraform for AI Agents (3): A Reusable VPC and Security Baseline

Mon, 16 Mar 2026 09:00:00 +0000

This article builds the single most copied piece of Terraform in my agent projects: a vpc-baseline module that gives every later component (ECS, RDS, OpenSearch, ACK) a sane place to land. It’s about 200 lines of HCL all-in. Worth typing once, refer to it forever.

By the end you’ll have:

A VPC across three availability zones in one region
Six vSwitches (one public + one private per zone) with non-overlapping CIDRs
A NAT Gateway with EIP for private-subnet outbound to LLM APIs
Three security groups stacked by tier (ALB → agent runtime → memory)
Three KMS customer master keys, one per data domain (memory, secrets, logs)
A clean module interface: name + CIDR + zones in, IDs out
Drift detection in CI, semver-pinned module references, and a per-line cost model

The mental model#

Before code, the picture:

Terraform for AI Agents (2): Provider, Auth, and Remote State on OSS

Sat, 14 Mar 2026 09:00:00 +0000

This article is where you stop reading and start typing. By the end, you’ll have:

The alicloud Terraform provider installed and version-pinned.
Authentication wired up — through the right method, not the convenient one.
Remote state on an OSS bucket with Tablestore-based locking.
Three workspaces (dev, staging, prod) that share a backend but isolate state.
A working terraform plan against an empty config.

Nothing here provisions an agent yet. This lays the foundation for all future articles. If you skip this and try to wing it in article 3, you’ll likely face a state-corruption incident within a week.

Terraform for AI Agents (1): Why IaC Is the Only Sane Way to Ship Agents

Thu, 12 Mar 2026 09:00:00 +0000

I have shipped four agent systems on Alibaba Cloud in the last eighteen months. Three of them started life as a tmux session on a single ECS instance someone created by clicking through the console. All three of those needed a panicked weekend of rebuilding when the second engineer joined the project, when the prod region had a stockout, or when the security team asked for a network diagram.

The fourth started life as terraform apply. It was the only one I haven’t lost a weekend to.