Series · Aliyun PAI · Chapter 1

Aliyun PAI (1): Platform Overview and the Product Family Map

What Aliyun PAI actually is in 2026, the four-layer architecture from the official docs, the five sub-products you'll touch, and a sane account/workspace setup so the rest of the series can skip the boilerplate.

If your team trains or serves any model on Alibaba Cloud, sooner or later you will end up in the PAI console. PAI is the umbrella; underneath it sit the actual workhorses — a notebook product, a distributed training service, a model-serving service, plus a couple of GUI/quick-deploy layers on top. After about eighteen months of running real LLM workloads on it for an AI marketing platform, this series is the field guide I wish someone had handed me before I shipped my first endpoint.

This first article is the lay of the land. It is deliberately short on code — articles 2 to 5 are the deep dives. The goal here is so that when I say “DLC job” or “EAS endpoint” later you already know what bucket they fall into.

What PAI is, and what it isn’t

Per the official docs, Platform for AI (PAI) is “Alibaba Cloud’s AI development platform covering the full lifecycle: data annotation, model development, training, and deployment”. The console at pai.console.aliyun.com is one entry point, but PAI itself is a family of related products that share an account model, an OSS-backed storage layer, and a single Python SDK.

The mental model that has worked best for me:

  • PAI is the shop.
  • DSW, DLC, EAS, Designer, Model Gallery are the workbenches inside.
  • ECS, OSS, NAS, CPFS are where the actual silicon and bytes live. PAI just orchestrates them on your behalf.

The official “Service architecture” topic spells it out as a four-layer stack:

PAI four-layer service architecture

Read that bottom-up. The infrastructure layer is the silicon — CPUs, GPUs, RDMA fabric, and ACK Kubernetes underneath. On top of that, Lingjun (灵骏) gives you very-high-density AI compute and general-purpose compute gives you everyday ECS-backed GPU pools. The platform-and-tools layer is where you spend your day: PyTorch / Megatron / DeepSpeed, plus PAI’s optimization toys (TorchAcc, BladeLLM, EasyCkpt, AIMaster), plus the visible products (DSW, DLC, EAS, Designer, FeatureStore, iTAG). The application layer is how PAI plugs into the rest of Alibaba’s MaaS world (ModelScope, Bailian/DashScope, Model Studio). The business layer is the marketing slide for industry use cases.

The reason to use PAI instead of raw ECS is that it pre-bakes the CUDA / PyTorch images, mounts your OSS bucket for you, gives you a metrics dashboard, and bills per second.

The five sub-products you actually touch

After a year and a half of production work I have only ever paid for these, drawn straight from the official “Core components” table:

ComponentPer the docsWhen to reach for it
DSW (Data Science Workshop)Cloud-based IDE with Jupyter / VSCode / terminal, pre-configured PyTorch and TensorFlow images, GPU instancesInteractive dev, debugging, small-scale training
DLC (Deep Learning Containers)Kubernetes-based training with Megatron, DeepSpeed, PyTorch, TF, Slurm, Ray, MPI, XGBoost — no cluster setupMulti-GPU / multi-node SFT, pretraining, large eval
EAS (Elastic Algorithm Service)Online inference with auto-scaling, canary release, traffic splitting, mirroringProduction inference endpoints
Designer140+ built-in algorithm components, drag-and-drop pipelines, exportable JSON, schedulable in DataWorksETL → train → eval flows handed off to non-coders
Model GalleryWraps DLC + EAS for zero-code deploy and fine-tune of catalogued open-source modelsEvaluating a Qwen / DeepSeek / Llama model in 10 minutes

There’s also iTAG (data annotation), PAI-Lingjun for very large clusters, PAI-Blade / BladeLLM for inference optimization, and FeatureStore, but unless you’re doing >1000-GPU pretraining or building a recommender system, you can ignore them on day one.

The product split maps cleanly onto the ML lifecycle:

PAI sub-products on the ML lifecycle

Designer and Model Gallery are orthogonal — they sit on top, generating jobs that ultimately run on the same DLC / EAS substrate.

How PAI relates to ECS and OSS

This trips up everyone who comes from a pure cloud-VM background. Three rules:

  1. PAI never owns your data. Datasets, checkpoints, and model artifacts all live in OSS (or NAS for POSIX semantics, or CPFS for HPC-style throughput). When a DSW or DLC instance dies, anything you didn’t write to OSS is gone. There is a “system disk” but treat it as /tmp.
  2. PAI does own the compute. You do not provision GPU ECS instances yourself for PAI workloads. PAI manages a pool, you ask for 1 * ecs.gn7i-c8g1.2xlarge and you get billed per second of allocation.
  3. PAI shares your account but uses its own RAM roles. When you grant PAI access to OSS, you’re attaching a service-linked role (AliyunPAIAccessingOSSRole) so PAI’s compute can read your bucket without a long-lived AK pair. Do not skip this — without it your DLC jobs will fail at data_loader time with a 403.

Real-world tip: The single most common “PAI is broken” ticket is a permission issue between PAI and OSS. Before debugging your training script, run oss ls oss://your-bucket/ from inside a DSW terminal. If that fails, fix the role, not the code.

Account, region, workspace

To get started you need three things in this order:

  1. An aliyun.com account with real-name verification (实名认证) — required for any GPU resource. International accounts work for most regions but Hangzhou, Shanghai, and Beijing have the best GPU stock.
  2. A region. Pick one and stick to it. PAI resources, OSS buckets, and ECS GPUs are all region-scoped, and cross-region traffic costs money and adds latency. For mainland production I default to cn-shanghai; for international, ap-southeast-1 (Singapore).
  3. A workspace. Per the docs, the workspace is PAI’s tenancy primitive — it holds quotas, datasets, model registries, and IAM bindings. You almost always want at least two: a dev workspace where humans poke around in DSW, and a prod workspace where DLC jobs and EAS endpoints live. Cross-workspace permissioning is fiddly, but the isolation pays for itself the first time an intern accidentally restarts a serving endpoint.

PAI tenancy: account, region, workspace

Two paths: console vs SDK

Like Bailian, PAI gives you two ways to do everything. The console is good for one-offs and inspecting state; the SDK is what you ship in CI.

The Python SDK is one package:

1
pip install alibabacloud-pai-python-sdk

A “hello PAI” — list your workspaces:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import os
from pai.session import setup_default_session

sess = setup_default_session(
    access_key_id=os.environ["ALIBABA_CLOUD_ACCESS_KEY_ID"],
    access_key_secret=os.environ["ALIBABA_CLOUD_ACCESS_KEY_SECRET"],
    region_id="cn-shanghai",
)

for ws in sess.workspace_api.list().items:
    print(ws.id, ws.name)

If that prints at least one workspace ID, your account, region, and credentials are wired correctly and you can move on to article 2.

Real-world tip: Use a sub-account with a scoped RAM policy for SDK work. Never use the root account access key — and if your AK pair shows up in any git history, rotate immediately. Aliyun’s leaked-key detection is OK but it’s not GitHub-grade fast.

Pricing model in one paragraph

The docs list five billing methods: pay-as-you-go, subscription (monthly/yearly prepaid), resource plan (DSW prepaid quota), savings plan (commit for discount), and pay-by-inference-duration (EAS serverless — no idle replica cost). DSW is per-second-of-instance while running, DLC is per-second with a separate quota for spot/preemptible GPUs that’s roughly 30-50% cheaper if your job can checkpoint, EAS is per-second-of-replica plus a small per-million-requests charge with auto-scaled minimum replicas dominating the cost. Designer and Model Gallery have no charge themselves — they spawn DLC/EAS resources that bill normally. There’s a small free tier for new accounts that’s enough to follow this whole series end-to-end.

What’s next

Article 2 is PAI-DSW end-to-end: picking the right GPU instance, the image catalog, OSS-FUSE mounting, and a working MNIST notebook (the one straight out of the official Quick Start). Article 3 is PAI-DLC distributed training — a real multi-GPU job with AIMaster fault tolerance. Article 4 is PAI-EAS model serving, including the cold-start trap that has bitten me more than once. Article 5 is the honest comparison of Designer vs Model Gallery for the “I just want to ship something” cases.

If you only read one, read article 4 — EAS is where most of the production money is spent and where the docs are thinnest.

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub