Aliyun PAI (4): PAI-EAS — Model Serving, Cold Starts, and the TPS Lie
End-to-end PAI-EAS for production: image-based deploy from OSS-mounted weights, the three inference modes, an autoscaler that doesn't blow your budget, and canary releases via service groups. Includes a working vLLM Qwen3 deployment from the official Quick Start.
EAS is where the money goes. DSW costs you a few hundred RMB a month for dev. DLC costs you in spikes. EAS bills 24/7 because someone might call your endpoint, and that “minimum replica count” line in the autoscaler config is the single highest-leverage knob in the whole platform. This article is what I wish I’d known the day before we shipped our first production endpoint.
What EAS is, per the docs
The official “EAS overview” frames it as: “deploy trained models as online inference services or AI web applications, with heterogeneous resources, automatic scaling, one-click stress testing, canary releases, and real-time monitoring”. The two things to underline:
- It’s a container-runtime serving layer — your model lives in OSS, your code lives in a container image, EAS pulls the image, mounts OSS at startup, runs your start command, and listens on a port.
- It’s autoscaled by replica count — not a serverless function model (with one important exception, see below). Replicas are real GPU pods that take 30-120s to come up. Plan for that.
The request path

The four moving parts the docs call out for runtime-image deployment:
- Runtime image — read-only template with OS, CUDA, Python, deps. Use an official one (
vllm:0.11.2-mows0.5.1,pytorch:...) or push your own to ACR. - Code and model — not in the image. They live in OSS / NAS. Decoupling them lets you update weights without rebuilding the image.
- Storage mounting — at startup, EAS FUSE-mounts the OSS path you specified to a directory inside the container, e.g.
/mnt/data/. - Run command — the first command after the container starts. Typically launches your HTTP server (
vllm serve /mnt/data/Qwen/Qwen3-0.6B).
Real-world tip: Bake
/mnt/data/into your code paths from day one. Do not let model paths get hardcoded to/workspace/models/. Switching from local-dev to EAS becomes a one-line config change instead of a code refactor.
Three inference modes
The docs list three. Pick deliberately — the wrong mode wastes either money or latency.

A practical heuristic:
- Real-time sync — chatbots, RAG retrieval, ad ranking, search. You care about p99 latency.
- Async — anything that takes 5+ seconds: image-gen, video-gen, OCR-on-PDF batches. The built-in queue scales replicas based on backlog, which is the right mental model for these workloads.
- Batch — anything you can wait minutes for: nightly embeddings, voice transcription. Use preemptible instances and cut the bill in half.
The Quick Start, in real config
The official Quick Start deploys Qwen3-0.6B with vLLM. The console flow is:
- Method: Image-based deployment.
- Image:
vllm:0.11.2-mows0.5.1(official EAS image — vLLM ≥ 0.8.5 is required for OpenAI-compatible chat). - Model: OSS,
oss://your-bucket/models/, mount path/mnt/data/. - Command:
vllm serve /mnt/data/Qwen/Qwen3-0___6B. - Resource:
ecs.gn7i-c16g1.4xlarge(1 × A10). - Click Deploy. ~5 minutes to
Running.
You then get an OpenAI-compatible endpoint at the URL the console gives you. Calling it:
| |
If that returns a sentence, your endpoint is alive and you can go back to your colleagues looking like a wizard.
Auto-scaling done right
This is the part the docs do not really hammer home. Default autoscaler behaviour (scale on request rate, min replicas = 1) is a recipe for either cold-start latency tickets or surprise bills.

The three settings that actually matter:
min_replicas— never set to zero in production. A cold start on a 7B vLLM container is 60-120 seconds; the user gives up at 5. I default to 2 (one for HA, one for redundancy). For asynchronous services you can do 0 and rely on the queue.max_replicas— the budget brake. Calculate as:(p99_qps_per_replica) * 2. If you don’t know your per-replica QPS, run the one-click stress test. The docs cover this under “Service stress testing”.- Scaling metric — by default it’s
qps. For LLM serving, switch toconcurrent_requests(or vLLM’srunningmetric). QPS is misleading because long generations don’t show up as more requests.
Real-world tip: The single biggest wasted spend I have ever seen on PAI was a
max_replicas=50autoscaler withmin_replicas=10on a service that got 0.5 QPS off-peak. 5 idle A10s, 24/7, for two months. Always look at the Saturday-night dashboard before you go on holiday.
Canary, blue/green, and traffic mirroring
EAS does this with service groups: a routing front-end that points at multiple service versions and splits traffic by percentage. The same primitive supports traffic mirroring — a copy of real traffic gets sent to a candidate version, but the response is discarded so users see no impact. This is the safest possible way to test a new model on production traffic.

I use a 90/10 split for the first 24 hours of any model swap, then 50/50, then 0/100. If any of those steps shows degradation in the success-rate or p99 metrics, rollback is immediate — service groups change traffic weights in seconds.
Stress testing — actually do this
The docs have a whole section on the one-click stress tester. Use it. It auto-ramps QPS, charts replica scale-out, and tells you the per-replica saturation point. That number is what you build your autoscaler around. Going to prod without one is the most common cause of “the model fell over at the 3pm peak” tickets.
The 180-day gotcha
Buried in the docs: “If an EAS service remains in a non-Running state for 180 consecutive days, the system automatically deletes the service.” Set a calendar reminder. I lost a service config once because the team that owned it dissolved and no one paid the bill. Restoring took an afternoon of re-bisecting which vllm version was on which weights.
What’s next
Article 5 closes the series with the honest pitch for Designer and Model Gallery — the two zero/low-code surfaces. They are not what most engineers reach for, but they earn their keep when used right, and there is a specific set of jobs where they are obviously the correct answer.