Aliyun-Fullstack on Chen Kai Blog

Alibaba Cloud Full Stack (12): End-to-End — One Terraform Apply for Everything

Sat, 09 May 2026 09:00:00 +0000

Eleven articles. Dozens of CLI commands. Hundreds of manual steps. Now we throw all of that away and rebuild the entire stack with a single terraform apply. This is why infrastructure-as-code exists.

Over the past eleven parts of this series, we have clicked through consoles, typed aliyun CLI commands, and manually configured everything from VPCs to Function Compute triggers. It worked. We learned every resource intimately because we built each one by hand. But if I asked you right now to recreate that entire stack in a new region — the VPC with its three tiers and two availability zones, the ECS instance with its cloud-init script, the RDS MySQL HA setup, the OSS bucket with lifecycle rules, the RAM policies, the SLS log pipeline, the Function Compute event processing — you would need at least a full day of careful work. And you would inevitably miss something. A security group rule. A backup policy. A CORS configuration.

Alibaba Cloud Full Stack (11): PAI — The ML Platform

Fri, 08 May 2026 09:00:00 +0000

Training a model on a single GPU is fun. Deploying it to handle 1,000 requests per second without failing is what separates experiments from products. PAI handles both.

PAI (Platform for AI) is Alibaba Cloud’s managed ML platform. It’s not just one product; it’s five products in a trench coat, sharing a console. These include a notebook environment for exploration, a distributed training service for scale, a model serving platform for production, a visual pipeline designer for those who prefer dragging boxes, and a model gallery for one-click deployment of open-source models. After eighteen months of running real LLM workloads on it, I can say that the individual components range from excellent (EAS) to good enough (Designer). The whole platform is genuinely greater than the sum of its parts once you understand how they connect.

Alibaba Cloud Full Stack (10): Bailian and DashScope — The LLM Layer

Thu, 07 May 2026 09:00:00 +0000

When I first needed an LLM API for a production app in China, my options were limited and expensive. Most international providers had no mainland endpoint, billing required a foreign credit card, and latency from calling US-based APIs was 800ms+ before a single token came back. Then Qwen showed up on DashScope with an OpenAI-compatible endpoint, and suddenly building AI products in China became as straightforward as anywhere else. Same SDK, same request shape, same streaming protocol — just a different base_url and a key from the Bailian console. I have been running production workloads against it for over a year now, and this article is the comprehensive walkthrough I wish I had on day one.

Alibaba Cloud Full Stack (9): OpenSearch and AI Search

Wed, 06 May 2026 09:00:00 +0000

I built my first search engine with Elasticsearch and a pile of synonyms. It took six months to get decent results. Every week, users complained about missing results, so I added more synonyms, broke something else, and added exception rules. The relevance tuning spreadsheet grew to 400 rows. I had custom analyzers for three languages, a boosting config that no one understood (including me), and a reindexing job that took four hours. Then I tried hybrid vector+keyword search on a side project and got better results on day one. Not marginally better — “users stopped complaining” better. That experience completely changed how I think about search, and it’s the reason this article exists.

Alibaba Cloud Full Stack (8): Serverless — Function Compute and EventBridge

Tue, 05 May 2026 09:00:00 +0000

The first time I saw a Function Compute bill that was 0.03 CNY for handling 10,000 requests, I started rethinking my entire architecture. I had been running a 2-vCPU ECS instance 24/7 to serve an API that processed maybe 200 requests per hour, paying around 490 CNY/month. The same workload on Function Compute cost under 5 CNY/month. Not 5 CNY per day — 5 CNY per month. The math was so lopsided that I spent the next weekend migrating everything that did not need a persistent process off ECS and onto functions.

Alibaba Cloud Full Stack (7): SLS, CloudMonitor, and Observability

Mon, 04 May 2026 09:00:00 +0000

The worst production outage I ever caused took three hours to diagnose. A Node.js service was returning 502s intermittently — maybe 5% of requests — and I had nothing. No centralized logs (each ECS instance had its own /var/log/ and I was SSH-ing into them one at a time). No metrics dashboards (I was running top and df -h in terminals). No tracing (I was adding console.log timestamps to try to figure out which downstream call was hanging). Three hours later, I found the issue: a connection pool to RDS was exhausting under load because a forgotten cron job was holding connections open. The fix was two lines of code. The diagnosis took three hours of misery because I had zero observability.

Alibaba Cloud Full Stack (6): RAM, KMS, and Cloud Security

Sun, 03 May 2026 09:00:00 +0000

I once found a DashScope API key hardcoded in a public GitHub repo. It was mine. Someone had forked a demo I pushed months earlier, and the key was sitting in a config file I forgot to gitignore. By the time I noticed, the key had been used to generate 14,000 Qwen API calls in a single weekend. The bill was not catastrophic — DashScope per-token pricing is forgiving — but the lesson was. I had treated cloud security as something I would figure out later. “Later” arrived as a billing alert at 2 AM on a Sunday.

Alibaba Cloud Full Stack (5): RDS and PolarDB — The Database Layer

Sat, 02 May 2026 09:00:00 +0000

My self-managed MySQL on ECS lasted exactly four months before a disk I/O spike during peak traffic brought the whole thing down. The InnoDB buffer pool was fighting the OS page cache for memory, the binary log was filling the system disk faster than my cron job could rotate it, and the single-threaded replication to my “backup” instance was nine hours behind. I fixed it at 3 AM by throwing more disk at it. Then it happened again two weeks later. That is the day I learned why managed databases exist — not because I cannot run MySQL, but because I do not want to be the person paged at 3 AM when MySQL decides the relay log is corrupted and the only fix is to rebuild the replica from a cold backup that may or may not be consistent.

Alibaba Cloud Full Stack (4): OSS — Object Storage Done Right

Fri, 01 May 2026 09:00:00 +0000

I used to store user uploads on the ECS disk. Profile pictures, PDF invoices, CSV exports — all dumped into /var/data/uploads/ on a single ecs.g7.large running my Flask app. I had a cron job that rsynced the directory to a second ECS instance every six hours as a “backup.” Then one Friday at 3am, the system disk hit 100% because a batch job generated 40GB of reports nobody ever downloaded, the instance went read-only, the app crashed, and the rsync hadn’t run since the previous evening. I lost six hours of user uploads and spent the weekend apologizing to customers. That was the week I learned that object storage is not a nice-to-have — it is the foundation of everything you build in the cloud. Your application server is ephemeral. Your data is not.

Alibaba Cloud Full Stack (3): VPC, SLB, and the Network Layer

Thu, 30 Apr 2026 09:00:00 +0000

Every outage I have debugged in the cloud ultimately traced back to networking. Bad CIDR planning that ran out of IPs six months in. Missing routes that silently dropped traffic between tiers. Security groups that were either wide open (hello, port 22 to 0.0.0.0/0) or so locked down that health checks failed and the load balancer kept draining healthy instances. Getting the network layer right is the single most important thing you can do before deploying anything else, and it is the single most painful thing to fix retroactively because changing a VPC CIDR means recreating everything inside it.

Alibaba Cloud Full Stack (2): ECS — Compute That Actually Makes Sense

Wed, 29 Apr 2026 09:00:00 +0000

The first ECS instance I ever launched was wildly over-provisioned. I picked the biggest instance I could find — an ecs.r6.8xlarge with 32 vCPUs and 256 GiB RAM — to run a Flask app that served maybe 20 requests per minute. I burned through credits in a week, panicked, learned how to downsize online, and discovered my app ran perfectly on a 2-vCPU box costing 94% less. Right-sizing matters more than raw power, and understanding the compute layer is the single most useful thing you can learn about any cloud platform.

Alibaba Cloud Full Stack (1): The Ecosystem Map — What Alibaba Cloud Actually Is

Tue, 28 Apr 2026 09:00:00 +0000

I spent my first week on Alibaba Cloud completely lost in a sea of product names. ECS, SLB, SLS, RDS, OSS, NAS, PAI, ARMS, ACK, FC, CDN, WAF, RAM, KMS, ROS, CloudMonitor, EventBridge, PolarDB, Lindorm, AnalyticDB, MaxCompute, DataWorks, Flink, DashScope, Bailian, OpenSearch… Every console page links to three more products I haven’t heard of. The documentation assumes you already know what everything is. The English translations are sometimes literal, sometimes creative, and occasionally missing. This is the guide I wish someone had handed me before I burned my first weekend clicking through consoles and reading translated docs that explained feature flags without ever explaining what the product does.