Cloud Computing on Chen Kai Blog

Alibaba Cloud Full Stack (12): End-to-End — One Terraform Apply for Everything

Sat, 09 May 2026 09:00:00 +0000

Eleven articles. Dozens of CLI commands. Hundreds of manual steps. Now we throw all of that away and rebuild the entire stack with a single terraform apply. This is why infrastructure-as-code exists.

Over the past eleven parts of this series, we have clicked through consoles, typed aliyun CLI commands, and manually configured everything from VPCs to Function Compute triggers. It worked. We learned every resource intimately because we built each one by hand. But if I asked you right now to recreate that entire stack in a new region — the VPC with its three tiers and two availability zones, the ECS instance with its cloud-init script, the RDS MySQL HA setup, the OSS bucket with lifecycle rules, the RAM policies, the SLS log pipeline, the Function Compute event processing — you would need at least a full day of careful work. And you would inevitably miss something. A security group rule. A backup policy. A CORS configuration.

Alibaba Cloud Full Stack (11): PAI — The ML Platform

Fri, 08 May 2026 09:00:00 +0000

Training a model on a single GPU is fun. Deploying it to handle 1,000 requests per second without failing is what separates experiments from products. PAI handles both.

PAI (Platform for AI) is Alibaba Cloud’s managed ML platform. It’s not just one product; it’s five products in a trench coat, sharing a console. These include a notebook environment for exploration, a distributed training service for scale, a model serving platform for production, a visual pipeline designer for those who prefer dragging boxes, and a model gallery for one-click deployment of open-source models. After eighteen months of running real LLM workloads on it, I can say that the individual components range from excellent (EAS) to good enough (Designer). The whole platform is genuinely greater than the sum of its parts once you understand how they connect.

Alibaba Cloud Full Stack (10): Bailian and DashScope — The LLM Layer

Thu, 07 May 2026 09:00:00 +0000

When I first needed an LLM API for a production app in China, my options were limited and expensive. Most international providers had no mainland endpoint, billing required a foreign credit card, and latency from calling US-based APIs was 800ms+ before a single token came back. Then Qwen showed up on DashScope with an OpenAI-compatible endpoint, and suddenly building AI products in China became as straightforward as anywhere else. Same SDK, same request shape, same streaming protocol — just a different base_url and a key from the Bailian console. I have been running production workloads against it for over a year now, and this article is the comprehensive walkthrough I wish I had on day one.

Alibaba Cloud Full Stack (9): OpenSearch and AI Search

Wed, 06 May 2026 09:00:00 +0000

I built my first search engine with Elasticsearch and a pile of synonyms. It took six months to get decent results. Every week, users complained about missing results, so I added more synonyms, broke something else, and added exception rules. The relevance tuning spreadsheet grew to 400 rows. I had custom analyzers for three languages, a boosting config that no one understood (including me), and a reindexing job that took four hours. Then I tried hybrid vector+keyword search on a side project and got better results on day one. Not marginally better — “users stopped complaining” better. That experience completely changed how I think about search, and it’s the reason this article exists.

Alibaba Cloud Full Stack (8): Serverless — Function Compute and EventBridge

Tue, 05 May 2026 09:00:00 +0000

The first time I saw a Function Compute bill that was 0.03 CNY for handling 10,000 requests, I started rethinking my entire architecture. I had been running a 2-vCPU ECS instance 24/7 to serve an API that processed maybe 200 requests per hour, paying around 490 CNY/month. The same workload on Function Compute cost under 5 CNY/month. Not 5 CNY per day — 5 CNY per month. The math was so lopsided that I spent the next weekend migrating everything that did not need a persistent process off ECS and onto functions.

Alibaba Cloud Full Stack (7): SLS, CloudMonitor, and Observability

Mon, 04 May 2026 09:00:00 +0000

The worst production outage I ever caused took three hours to diagnose. A Node.js service was returning 502s intermittently — maybe 5% of requests — and I had nothing. No centralized logs (each ECS instance had its own /var/log/ and I was SSH-ing into them one at a time). No metrics dashboards (I was running top and df -h in terminals). No tracing (I was adding console.log timestamps to try to figure out which downstream call was hanging). Three hours later, I found the issue: a connection pool to RDS was exhausting under load because a forgotten cron job was holding connections open. The fix was two lines of code. The diagnosis took three hours of misery because I had zero observability.

Alibaba Cloud Full Stack (6): RAM, KMS, and Cloud Security

Sun, 03 May 2026 09:00:00 +0000

I once found a DashScope API key hardcoded in a public GitHub repo. It was mine. Someone had forked a demo I pushed months earlier, and the key was sitting in a config file I forgot to gitignore. By the time I noticed, the key had been used to generate 14,000 Qwen API calls in a single weekend. The bill was not catastrophic — DashScope per-token pricing is forgiving — but the lesson was. I had treated cloud security as something I would figure out later. “Later” arrived as a billing alert at 2 AM on a Sunday.

Alibaba Cloud Full Stack (5): RDS and PolarDB — The Database Layer

Sat, 02 May 2026 09:00:00 +0000

My self-managed MySQL on ECS lasted exactly four months before a disk I/O spike during peak traffic brought the whole thing down. The InnoDB buffer pool was fighting the OS page cache for memory, the binary log was filling the system disk faster than my cron job could rotate it, and the single-threaded replication to my “backup” instance was nine hours behind. I fixed it at 3 AM by throwing more disk at it. Then it happened again two weeks later. That is the day I learned why managed databases exist — not because I cannot run MySQL, but because I do not want to be the person paged at 3 AM when MySQL decides the relay log is corrupted and the only fix is to rebuild the replica from a cold backup that may or may not be consistent.

Alibaba Cloud Full Stack (4): OSS — Object Storage Done Right

Fri, 01 May 2026 09:00:00 +0000

I used to store user uploads on the ECS disk. Profile pictures, PDF invoices, CSV exports — all dumped into /var/data/uploads/ on a single ecs.g7.large running my Flask app. I had a cron job that rsynced the directory to a second ECS instance every six hours as a “backup.” Then one Friday at 3am, the system disk hit 100% because a batch job generated 40GB of reports nobody ever downloaded, the instance went read-only, the app crashed, and the rsync hadn’t run since the previous evening. I lost six hours of user uploads and spent the weekend apologizing to customers. That was the week I learned that object storage is not a nice-to-have — it is the foundation of everything you build in the cloud. Your application server is ephemeral. Your data is not.

Alibaba Cloud Full Stack (3): VPC, SLB, and the Network Layer

Thu, 30 Apr 2026 09:00:00 +0000

Every outage I have debugged in the cloud ultimately traced back to networking. Bad CIDR planning that ran out of IPs six months in. Missing routes that silently dropped traffic between tiers. Security groups that were either wide open (hello, port 22 to 0.0.0.0/0) or so locked down that health checks failed and the load balancer kept draining healthy instances. Getting the network layer right is the single most important thing you can do before deploying anything else, and it is the single most painful thing to fix retroactively because changing a VPC CIDR means recreating everything inside it.

Alibaba Cloud Full Stack (2): ECS — Compute That Actually Makes Sense

Wed, 29 Apr 2026 09:00:00 +0000

The first ECS instance I ever launched was wildly over-provisioned. I picked the biggest instance I could find — an ecs.r6.8xlarge with 32 vCPUs and 256 GiB RAM — to run a Flask app that served maybe 20 requests per minute. I burned through credits in a week, panicked, learned how to downsize online, and discovered my app ran perfectly on a 2-vCPU box costing 94% less. Right-sizing matters more than raw power, and understanding the compute layer is the single most useful thing you can learn about any cloud platform.

Alibaba Cloud Full Stack (1): The Ecosystem Map — What Alibaba Cloud Actually Is

Tue, 28 Apr 2026 09:00:00 +0000

I spent my first week on Alibaba Cloud completely lost in a sea of product names. ECS, SLB, SLS, RDS, OSS, NAS, PAI, ARMS, ACK, FC, CDN, WAF, RAM, KMS, ROS, CloudMonitor, EventBridge, PolarDB, Lindorm, AnalyticDB, MaxCompute, DataWorks, Flink, DashScope, Bailian, OpenSearch… Every console page links to three more products I haven’t heard of. The documentation assumes you already know what everything is. The English translations are sometimes literal, sometimes creative, and occasionally missing. This is the guide I wish someone had handed me before I burned my first weekend clicking through consoles and reading translated docs that explained feature flags without ever explaining what the product does.

LAMP Stack on Alibaba Cloud ECS: From Fresh Instance to Production-Ready Web Server

Wed, 28 Jun 2023 09:00:00 +0000

You have a fresh ECS instance and SSH access. Your goal is to run a public website with Apache, PHP, and MySQL. Three common issues often trip up beginners:

Network reachability — packets are silently dropped by the cloud security group, the OS firewall, or the listening socket, and the symptom is always the same: nothing happens.
Service wiring — Apache, PHP, and MySQL are separate processes that need to find each other through file extensions, Unix sockets, and TCP ports. Each interface has its own way to fail.
Identity and permissions — Apache runs as www-data, MySQL runs as mysql, files are owned by root after wget. The wrong combination produces 403, “Access denied”, or chmod 777 desperation.

This guide covers these issues in the order you’ll encounter them on day one and continues with topics that arise later, such as TLS, virtual hosts, backups, source compilation, and when to stop running everything on a single box.

Cloud Computing (8): Multi-Cloud and Hybrid Architecture

Wed, 14 Jun 2023 09:00:00 +0000

The first article in this series asked, “What is the cloud, and why does it matter?” Eight articles later, the question has evolved into something more practical: Which clouds, in what combination, and how do you manage them without losing your mind? Multi-cloud and hybrid architectures are how serious organizations answer that question. They distribute workloads across providers and on-premises infrastructure for resilience, cost optimization, and strategic flexibility — but they introduce a new class of problems that single-cloud architectures never face.

Cloud Computing (7): Cloud Operations and DevOps Practices

Fri, 26 May 2023 09:00:00 +0000

In 2017 GitLab lost six hours of database state. An engineer, exhausted, ran rm -rf on the wrong server during an incident. The backup procedures had silently been broken for months; nobody noticed because no one was restoring from backups. The lesson is not “be careful with rm”. The lesson is that operations is a system — tools, runbooks, monitoring, automation, and the rituals around them. When the system is healthy, no single tired engineer can take down production. When the system is rotten, every late-night fix is one keystroke from disaster.

Cloud Computing (6): Cloud Security and Privacy Protection

Sun, 07 May 2023 09:00:00 +0000

In 2019 Capital One lost a hundred million customer records. The exploit chain was small: a misconfigured WAF allowed server-side request forgery against the EC2 metadata endpoint, that endpoint handed back IAM credentials, and the IAM role those credentials belonged to had wildcard s3:* on every bucket in the account. One misconfiguration, one over-broad role, one rule the security team had not written. The bill, before legal: more than 80 million dollars.

Cloud Computing (5): Cloud Network Architecture and SDN

Tue, 18 Apr 2023 09:00:00 +0000

A cloud platform is essentially a network with attached computers. The compute layer scales by adding servers; the storage layer scales by adding disks; the network layer integrates these into a single, coherent system. Get the network right, and the rest of the stack feels effortless. Get it wrong — a missing route, a 5-tuple mismatch in a security group, or an under-provisioned load balancer — and the whole platform goes dark.

Cloud Computing (4): Cloud Storage Systems and Distributed Architecture

Thu, 30 Mar 2023 09:00:00 +0000

When Netflix stores petabytes of video, when Instagram serves billions of photos, when a quant fund replays a year of market data in minutes — behind every one of these workloads is a distributed storage system. Storage looks deceptively simple from a developer’s window (PUT key, GET key), but the moment you cross the boundary of a single machine, you inherit a stack of problems that has driven decades of research: how to survive disk failures, how to scale linearly, how to provide a consistency model that does not surprise the application, and how to do all of this while paying cents per gigabyte rather than dollars.

Cloud Computing (3): Cloud-Native and Container Technologies

Sat, 11 Mar 2023 09:00:00 +0000

The shift from monolithic applications to cloud-native architectures is one of the most consequential changes in software engineering this decade. The headline — containers and Kubernetes — is well known. The interesting story is why this stack won, what each layer actually does, and where the seams are that determine whether your platform feels effortless or feels like a maze.

Cloud Computing (2): Virtualization Technology Deep Dive

Mon, 20 Feb 2023 09:00:00 +0000

Without virtualization, there is no cloud. Every EC2 instance, every Lambda invocation, every Kubernetes pod ultimately stands on the same trick: lying convincingly to an operating system about the hardware underneath it. This article walks the full stack — from the CPU instructions that make the trick cheap, through the four hypervisors that dominate the market, to the production-grade tuning knobs that decide whether your VMs run at 70 % or 99 % of bare metal.

Cloud Computing (1): Fundamentals and Architecture

Wed, 01 Feb 2023 09:00:00 +0000

Every team building software in 2025 inherits the same buy-or-rent question their predecessors faced — only the answer has flipped. Twenty years ago you put hardware in a closet; today you describe the hardware in YAML and a global provider conjures it up in seconds, bills it by the second, and tears it down when you stop paying. Cloud computing is not just “someone else’s computer”. It is a programmable, metered, multi-tenant abstraction over compute, storage and networking that has fundamentally changed how businesses are built and how engineers spend their day.