
Alibaba Cloud Full Stack (2): ECS — Compute That Actually Makes Sense
Everything you need to know about ECS: instance families (g8, c8, r8, GPU), pricing models, cloud-init automation, security groups, and key pairs. We deploy a production-ready app server from scratch.
The first ECS instance I ever launched was wildly over-provisioned. I picked the biggest instance I could find — an ecs.r6.8xlarge with 32 vCPUs and 256 GiB RAM — to run a Flask app that served maybe 20 requests per minute. I burned through credits in a week, panicked, learned how to downsize online, and discovered my app ran perfectly on a 2-vCPU box costing 94% less. Right-sizing matters more than raw power, and understanding the compute layer is the single most useful thing you can learn about any cloud platform.
This article is the complete guide to Elastic Compute Service. We start from what ECS actually is, move through instance families and pricing models, then build a production-ready app server from scratch using the CLI. By the end, you will have enough working knowledge to provision, secure, and operate ECS instances for real workloads.
What ECS actually is#
Elastic Compute Service is Alibaba Cloud’s virtual machine product. If you have used AWS EC2, Azure VMs, or GCP Compute Engine, ECS is the direct equivalent. You get a virtual server running Linux or Windows, connected to a virtual network, with block storage attached. You control it over SSH or RDP, and you pay by the hour or by the month.
But ECS is not just “a VM.” It is a composition of six building blocks, and understanding each one separately saves a lot of confusion:
| Component | What it is | AWS equivalent |
|---|---|---|
| Instance | The virtual machine itself — vCPUs, RAM, local NVMe | EC2 instance |
| Image | The OS template used to boot the instance | AMI |
| Block Storage (disk) | Network-attached persistent storage — system disk + data disks | EBS |
| Security Group | Stateful firewall rules attached to the instance’s network interface | Security Group |
| VPC / VSwitch | The virtual network and subnet the instance lives in | VPC / Subnet |
| ENI | Elastic Network Interface — the virtual NIC | ENI |
When you “create an ECS instance” in the console, you are actually configuring all six at once. The console bundles them to reduce friction, but they are separate resources with independent lifecycles. You can detach a disk from one instance and attach it to another. You can move an ENI between instances. Security groups are shared across instances. Understanding this decomposition is what separates someone who uses ECS from someone who operates it.
The ECS lifecycle#
Every instance passes through a well-defined state machine:
| |
The states that matter in practice:
- Stopped: No compute charges on pay-as-you-go (disk and IP charges continue). This is the state you want for dev instances overnight.
- Running: The instance is up and billing. CPU/memory/network metering is active.
- Stopping: Brief transition. On a
ForceStopthis can take up to 60 seconds. Graceful stop sends ACPI shutdown to the OS and waits. - Released: Gone. The instance, its system disk, and its local disks are permanently deleted. Data disks survive only if you configured them with
DeleteWithInstance = false.
One thing that trips people up: a Stopped instance still holds its private IP and its elastic IP association. You are not releasing network resources by stopping. If you want to actually free everything, you release the instance.
ECS vs EC2: what’s different#
If you’re coming from AWS, these are the meaningful differences:
- No instance store volumes in the EC2 sense. ECS local NVMe disks exist on specific instance families (i-series), but most workloads use cloud disks exclusively.
- Security groups are VPC-scoped, not region-scoped. You cannot share a security group across VPCs.
- Metadata endpoint is
http://100.100.100.200/latest/meta-data/instead of169.254.169.254. Cloud-init works the same way, but if you’re porting scripts, update the URL. - Chinese regions have separate infrastructure.
cn-hangzhouandus-east-1are not just different availability zones — they are completely independent control planes with separate accounts and billing.
Instance family deep dive#
Instance families are the core abstraction for hardware specialization. The naming convention is:

| |
The suffix letter after the generation number indicates the processor:
- No suffix or
i= Intel Xeon a= AMD EPYCy= Alibaba Yitian 710 (ARM)
Here is the instance family reference you will actually use:
| Family | Type | vCPU : Memory | Processor | Network (Gbps) | Best for |
|---|---|---|---|---|---|
| g7 | General Purpose | 1:4 | Intel Xeon (Ice Lake) | Up to 25 | Web servers, mid-tier APIs |
| g8i | General Purpose | 1:4 | Intel Xeon (Sapphire Rapids) | Up to 40 | General workloads, latest gen |
| g8y | General Purpose | 1:4 | Yitian 710 (ARM) | Up to 40 | Cost-efficient ARM workloads |
| c7 | Compute Optimized | 1:2 | Intel Xeon (Ice Lake) | Up to 25 | High-CPU: encoding, CI/CD |
| c8i | Compute Optimized | 1:2 | Intel Xeon (Sapphire Rapids) | Up to 40 | Batch processing, game servers |
| c8y | Compute Optimized | 1:2 | Yitian 710 (ARM) | Up to 40 | ARM CI/CD, build farms |
| r7 | Memory Optimized | 1:8 | Intel Xeon (Ice Lake) | Up to 25 | Databases, in-memory caches |
| r8i | Memory Optimized | 1:8 | Intel Xeon (Sapphire Rapids) | Up to 40 | Redis, Elasticsearch |
| r8y | Memory Optimized | 1:8 | Yitian 710 (ARM) | Up to 40 | Cost-efficient memory workloads |
| gn7i | GPU | Varies | A10 (24 GB) | Up to 32 | ML inference, fine-tuning |
| gn7 | GPU | Varies | V100 (16/32 GB) | Up to 25 | ML training, HPC |
| gn6v | GPU | Varies | V100 (16 GB) | Up to 5 | Budget ML, dev inference |
| t6 | Burstable | 1:1/1:2/1:4 | Intel Xeon | Up to 1.2 | Dev/test, micro workloads |
| ebmg7 | Bare Metal | 1:4 | Intel Xeon (Ice Lake) | Up to 65 | High-performance, no hypervisor |
| ebmc7 | Bare Metal | 1:2 | Intel Xeon (Ice Lake) | Up to 65 | Dedicated hardware, compliance |
Sizes within a family#
Each family offers sizes that double resources at each step:
| Size | vCPU | Memory (g-family, 1:4 ratio) |
|---|---|---|
| small | 1 | 2 GiB |
| large | 2 | 8 GiB |
| xlarge | 4 | 16 GiB |
| 2xlarge | 8 | 32 GiB |
| 4xlarge | 16 | 64 GiB |
| 8xlarge | 32 | 128 GiB |
| 16xlarge | 64 | 256 GiB |
Practical note on Yitian 710 (ARM): The
*yfamilies are 20-30% cheaper than their Intel equivalents for the same specs. If your workload runs on Linux and you are not dependent on x86-specific binaries, always try the ARM variant first. Most Python/Node/Go/Java workloads just work. Docker images need to be multi-arch (linux/arm64), which is a one-line change in your Dockerfile build.
Burstable instances: the trap and the fix#
The t6 family deserves special mention because it trips up almost everyone. Burstable instances accumulate CPU credits when idle and spend them when busy. Once you run out of credits, your CPU is throttled to a baseline — typically 10-20% of a vCPU.
This is perfect for a dev/test box that sits idle most of the day and occasionally runs a build. It is terrible for any workload with sustained CPU usage. I have personally seen production databases on t6.large instances run fine for weeks, then suddenly crater during a traffic spike because the credit balance hit zero.
The rule: if your average CPU exceeds the baseline (check the product page for your specific size), you need a non-burstable instance. Period.
Choosing the right size#
Decision-making here is simpler than it looks. Start with the workload, not the instance:

| Workload | Recommended start | Why |
|---|---|---|
| Static site / reverse proxy | ecs.c7.large (2 vCPU, 4 GiB) | CPU-bound, barely needs memory |
| REST API backend (Node/Python) | ecs.g7.xlarge (4 vCPU, 16 GiB) | Balanced — some CPU for JSON serialization, memory for connection pools |
| PostgreSQL / MySQL | ecs.r7.2xlarge (8 vCPU, 64 GiB) | Memory-heavy for buffer pool. Consider RDS instead. |
| Redis / Memcached | ecs.r7.xlarge (4 vCPU, 32 GiB) | All about memory. Again, consider managed Redis. |
| CI/CD runner | ecs.c8y.xlarge (4 vCPU, 8 GiB) | CPU-bound compilation. ARM is fine for most builds. |
| ML inference (LLM) | ecs.gn7i.xlarge (4 vCPU, 1x A10) | GPU for matrix ops, moderate CPU for pre/post processing |
| ML training | ecs.gn7.8xlarge (32 vCPU, 8x V100) | Multi-GPU for distributed training |
| Dev/test throwaway | ecs.t6.large (2 vCPU, 4 GiB) | Cheap, burstable, stop it at night |
The golden rule: start small, monitor for one week, then resize. ECS supports online instance type changes for most families — stop the instance, change the type, start it again. The whole process takes under two minutes. Over-provisioning from day one is burning money on speculation.
Pricing models explained#
ECS offers four ways to pay, and choosing the right one can cut your bill by 80%. Here they are, from most expensive to least:

Pay-as-you-go (PAYG)#
Billed per second, minimum one-minute granularity. When you stop the instance, compute charges stop (disk and IP continue). This is the default and the most expensive per-hour, but it has zero commitment and zero waste — you pay only for what you use.
Best for: dev/test, spiky workloads, instances that run a few hours per day.
Subscription (prepaid)#
Commit to 1 month, 3 months, 6 months, 1 year, 2 years, or 3 years. Discounts range from ~15% (1 month) to ~50% (3 years) compared to PAYG. You pay upfront. The instance runs whether you use it or not.
Best for: production workloads with predictable, steady utilization.
Preemptible instances (spot)#
Same hardware as PAYG, but prices float based on supply and demand — typically 70-90% cheaper. The catch: Alibaba Cloud can reclaim your instance with a 5-minute warning when demand spikes. You get interrupted.
Best for: stateless batch processing, CI/CD, distributed ML training (with checkpointing), any workload that can handle interruption.
Savings plans and reserved instances#
Savings Plans let you commit to a spending amount (e.g., 100 CNY/hour) across any instance family in a region, at 30-60% discount. Reserved Instances are similar but locked to a specific instance type and AZ.
Best for: large, well-understood fleets where you know your baseline usage.
Pricing comparison#
Here is a real cost comparison for ecs.c7.large (2 vCPU, 4 GiB) in cn-beijing, running 24/7 for one month (prices approximate, check current rates):
| Model | Hourly (CNY) | Monthly (CNY) | Savings vs PAYG |
|---|---|---|---|
| Pay-as-you-go | 0.68 | ~490 | — |
| Subscription (1 month) | — | ~415 | 15% |
| Subscription (1 year) | — | ~310 | 37% |
| Subscription (3 years) | — | ~245 | 50% |
| Preemptible (avg) | 0.10 | ~72 | 85% |
| Savings Plan (1 year) | — | ~295 | 40% |
The preemptible price is not a typo. If your workload tolerates interruption, spot instances are almost free. I run all CI/CD on spot and have been interrupted maybe three times in a year — always during major Chinese holiday shopping events when demand peaks.
The hybrid strategy I actually use: Production runs on Subscription (1-year). Dev/test runs on PAYG with auto-stop scripts at midnight. Batch jobs run on Preemptible with a fallback to PAYG if spot capacity is unavailable. This cuts the overall bill by about 45% compared to all-PAYG.
Creating an ECS instance step by step#
Prerequisites#
Before you create an instance, you need:
- A VPC and VSwitch in your target region. We cover VPC setup in detail in Part 3
, but for this walkthrough, I will assume you have a VPC with a VSwitch in
cn-beijing-h. - A security group in that VPC (we create one below).
- A key pair for SSH access (we create one below).
- The Alibaba Cloud CLI (
aliyun) installed and configured. Runaliyun configureif you haven’t already.
Console walkthrough (quick version)#
The console path is: ECS Console > Instances > Create Instance.
- Billing: Select Pay-As-You-Go for now.
- Region:
China (Beijing), Zone H. - Instance Type: Search for
ecs.c7.large. Select it. - Image: Alibaba Cloud Linux 3.2104 LTS 64-bit (the default). This is CentOS-compatible and free.
- Storage: System disk = 40 GiB ESSD PL0. No data disk for now.
- Networking: Select your VPC, your VSwitch in zone H. Assign a public IP (1 Mbps is fine for SSH; use an SLB/ALB for production traffic).
- Security Group: Select or create one that allows TCP 22 from your IP.
- Login: Key Pair (not password).
- Advanced: Paste your cloud-init script in User Data (we write one below).
- Create.
That is seven clicks and three dropdowns. But if you are going to create more than one instance, or if you want reproducibility, use the CLI.
CLI walkthrough (the real way)#
First, let’s create the supporting resources. If you already have a VPC and security group, skip ahead to the instance creation.
Create a security group:
| |
Save the SecurityGroupId from the response — you will need it.
Create a key pair:
| |
Create the ECS instance:
| |
Note: --ImageId changes as new images are released. To find the latest Alibaba Cloud Linux 3 image:
| |
After creation, the instance is in Stopped state. Start it:
| |
Wait about 30 seconds, then SSH in:
| |
Cloud-init: automate everything from boot#
Nobody should SSH into a fresh instance and manually install packages. Cloud-init runs on first boot and configures the instance automatically. Every ECS image ships with cloud-init pre-installed.

You pass your cloud-init configuration as UserData when creating the instance. It must be base64-encoded. Here is a comprehensive cloud-init.yaml that sets up a production-ready app server:
| |
After the instance boots, cloud-init processes this file in stages: set timezone, install packages, create users, write files, run commands. The whole process takes 2-3 minutes on an ecs.c7.large.
To verify cloud-init completed successfully:
| |
Debugging tip: The most common cloud-init failure is a YAML indentation error. Validate your config locally with
cloud-init schema --config-file cloud-init.yamlbefore base64-encoding it. Also,UserDatahas a 16 KiB limit — if your script is complex, have cloud-init pull a script from OSS instead.
Security groups: your first firewall#
A security group is a stateful firewall at the ENI level. Every packet entering or leaving an ECS instance is evaluated against its security group rules. If no rule matches, the packet is dropped — default deny.

“Stateful” means that if you allow inbound TCP 80, the response packets are automatically allowed out. You do not need a matching outbound rule for return traffic.
Common security group patterns#
Web server (public-facing):
| |
Database server (internal only):
| |
Notice the last example uses --SourceGroupId instead of --SourceCidrIp. This means “allow traffic from any instance in that security group.” This is the right way to do internal service communication — you never hardcode IPs.
Rules I apply to every security group#
- Never open SSH (22) to
0.0.0.0/0. Use your office CIDR, or better, use a bastion host. - Never open database ports to the internet. PostgreSQL on 5432, MySQL on 3306, Redis on 6379 — these should only be reachable from your app security group.
- Use descriptions on every rule. Six months from now, you will not remember why port 8443 is open. The description tells you.
- Review rules quarterly. Security groups accumulate stale rules like barnacles. Set a calendar reminder.
Key pairs and SSH access#
Passwords are bad for SSH. They are brute-forceable, they encourage password reuse, and they cannot be rotated without logging into the instance. Key pairs are the standard.
Creating and using key pairs#
We created a key pair earlier. Here is the complete SSH workflow:
| |
With this config:
| |
The ProxyJump directive is the modern replacement for SSH tunneling through bastion hosts. It establishes the SSH connection through bastion transparently — you do not need to SSH to the bastion first, then SSH to the internal host. One command, two hops, no exposed internal IPs.
Key rotation#
To rotate keys without downtime:
- Generate a new key pair in the ECS console or CLI.
- Add the new public key to
~/.ssh/authorized_keyson the target instance (or use cloud-init to manage keys via the API). - Test login with the new key.
- Remove the old public key from
authorized_keys. - Delete the old key pair from ECS.
For fleets, manage SSH keys through cloud-init or an Ansible playbook — never manually edit authorized_keys on 20 instances.
Disks and storage#
Every ECS instance has at least one disk: the system disk, which holds the OS. You can attach up to 16 additional data disks. All disks are network-attached block storage — they persist independently of the instance lifecycle (if configured correctly).

Disk types#
| Type | IOPS (max) | Throughput (max) | Latency | Best for |
|---|---|---|---|---|
| ESSD PL0 | 10,000 | 180 MB/s | 0.2-0.5 ms | System disks, light workloads |
| ESSD PL1 | 50,000 | 350 MB/s | 0.1-0.3 ms | General production, databases |
| ESSD PL2 | 100,000 | 750 MB/s | 0.1-0.3 ms | High-IOPS databases |
| ESSD PL3 | 1,000,000 | 4,000 MB/s | 0.1-0.3 ms | Extreme performance, OLTP |
| Standard SSD | 25,000 | 300 MB/s | 0.5-2 ms | Legacy, non-critical |
| Ultra Disk | 5,000 | 140 MB/s | 1-3 ms | Cold storage, archives |
The performance level (PL0 through PL3) is the single most important storage decision you make. A database on PL0 will hit the 10,000 IOPS ceiling and queue operations; the same database on PL1 has 5x the headroom. The price difference between PL0 and PL1 is about 2x — still cheap compared to the compute cost.
My default: PL0 for system disks (the OS does not need high IOPS), PL1 for data disks running databases or anything with fsync. If you are not sure, start with PL1 — the cost difference for a 100 GiB disk is about 40 CNY/month.
Expanding disks online#
ECS supports online disk expansion — you can grow a disk without stopping the instance or unmounting the filesystem:
| |
You can only expand, never shrink. Plan your initial size conservatively and grow as needed — this is cloud, not a physical server where disk replacement requires a maintenance window.
Snapshots#
Snapshots are point-in-time copies of a disk. They are incremental (only changed blocks are stored) and crash-consistent. Use them for:
- Backup before risky operations.
aliyun ecs CreateSnapshot --DiskId d-xxxbefore you run that database migration. - Creating images. Snapshot a configured instance’s system disk, then create a custom image from it. Every new instance boots fully configured in 30 seconds instead of waiting 3 minutes for cloud-init.
- Disaster recovery. Set up automatic snapshot policies — daily snapshots retained for 7 days is a reasonable starting point.
| |
Monitoring and maintenance#
CloudMonitor metrics#

Every ECS instance automatically reports metrics to CloudMonitor. The ones you should watch:
- CPUUtilization: Sustained >80% means you need to scale up or out.
- MemoryUsedUtilization: Sustained >85% means you are approaching OOM territory. Note: this metric requires the CloudMonitor agent (installed by default on Alibaba Cloud Linux).
- DiskReadIOPS / DiskWriteIOPS: Compare against your disk’s PL limit. If you are consistently at 80% of the limit, upgrade the PL or add a disk.
- IntranetInRate / IntranetOutRate: Network throughput. If you are hitting the instance family’s limit, you need a bigger instance.
- vm.TcpConnectionCount: Connection count. Useful for detecting connection leaks.
Setting up alerts#
| |
Scheduled maintenance and live migration#
Alibaba Cloud periodically maintains the physical infrastructure. When your instance’s host needs maintenance, you receive a notification (email, SMS, or console alert) with a scheduled window — typically 2-4 weeks away.
For most instance families, Alibaba Cloud performs live migration: your instance is moved to another physical host with near-zero downtime (a brief pause of 10-100ms during the final memory copy). You do not need to do anything.
For instances that cannot be live-migrated (bare metal, GPU instances with VGPU), you get a maintenance window and need to restart the instance yourself. Set up a process to check for pending maintenance events:
| |
Solution: production-ready app server#
Let’s put everything together. Here is the complete sequence to go from nothing to a running, secured, monitored Flask application on ECS — all via CLI.
Step 1: Create the VPC and VSwitch#
| |
Step 2: Create security group with rules#
| |
Step 3: Create key pair#
| |
Step 4: Find the latest image#
| |
Step 5: Create and start the instance#
| |
Step 6: Verify the deployment#
| |
Step 7: Set up HTTPS with certbot#
After pointing your domain’s DNS A record to the public IP:
| |
Step 8: Set up monitoring#
| |
That is a complete production deployment: VPC, security group, key pair, ECS instance with cloud-init automation, nginx reverse proxy, Flask app under supervisor, HTTPS with auto-renewal, and monitoring alerts. From zero to production in about 5 minutes of API calls and 3 minutes of cloud-init execution.
Summary#
Start small, resize later. ECS supports online instance type changes. Do not guess what you need — measure and adjust.
Use the right instance family. General Purpose (g-series) is the default. Switch to Compute (c-series) for CPU-bound, Memory (r-series) for data-heavy, or GPU (gn-series) for ML workloads. Try ARM (y-suffix) for 20-30% cost savings.
Mix pricing models. Subscription for steady production, PAYG for dev/test, Preemptible for batch. A hybrid strategy typically saves 40-50% over all-PAYG.
Automate from boot. Cloud-init should bring every instance to production-ready without manual SSH. If you find yourself SSHing in to install things, your cloud-init is incomplete.
Security groups are not optional. Default deny, allow only what you need, use security group references (not IPs) for internal traffic, review quarterly.
Snapshots are your safety net. Automatic daily snapshots cost almost nothing and have saved me from disaster more than once.
For infrastructure-as-code approach to ECS, see our Terraform series, Part 4: Compute , which covers the same concepts as Terraform resources. Next up in this series, Part 3 dives deep into VPC, VSwitches, route tables, and NAT gateways — the networking layer that connects all your ECS instances.
Alibaba Cloud Full Stack 12 parts
- 01 Alibaba Cloud Full Stack (1): The Ecosystem Map — What Alibaba Cloud Actually Is
- 02 Alibaba Cloud Full Stack (2): ECS — Compute That Actually Makes Sense you are here
- 03 Alibaba Cloud Full Stack (3): VPC, SLB, and the Network Layer
- 04 Alibaba Cloud Full Stack (4): OSS — Object Storage Done Right
- 05 Alibaba Cloud Full Stack (5): RDS and PolarDB — The Database Layer
- 06 Alibaba Cloud Full Stack (6): RAM, KMS, and Cloud Security
- 07 Alibaba Cloud Full Stack (7): SLS, CloudMonitor, and Observability
- 08 Alibaba Cloud Full Stack (8): Serverless — Function Compute and EventBridge
- 09 Alibaba Cloud Full Stack (9): OpenSearch and AI Search
- 10 Alibaba Cloud Full Stack (10): Bailian and DashScope — The LLM Layer
- 11 Alibaba Cloud Full Stack (11): PAI — The ML Platform
- 12 Alibaba Cloud Full Stack (12): End-to-End — One Terraform Apply for Everything