Series · Aliyun PAI · Chapter 3

Aliyun PAI (3): PAI-DLC — Distributed Training Without the Cluster Pain

Submit a real multi-GPU training job on PAI-DLC, understand the resource pools (Lingjun vs general vs preemptible), and use AIMaster + EasyCKPT so a flaky node doesn't cost you a day.

A DSW notebook is for one engineer on one GPU. The moment you need eight GPUs across two nodes, or the moment training runs longer than the eight hours you’ll keep the tab open, you switch to DLC. DLC is PAI’s job-submission front-end for a managed Kubernetes cluster: you describe what you want (image, command, resources, data mounts), DLC schedules pods, runs them to completion, persists logs, and tells you what happened. The docs call this Deep Learning Containers; we just say “DLC job”.

What the docs actually claim

The official DLC overview lists four bullets I want to highlight, because they matter:

  • Diverse compute — Lingjun AI computing service, ECS, ECI, Shenlong bare metal, Lingjun bare metal. Hybrid scheduling.
  • Multiple distributed job types — pre-built support for Megatron, DeepSpeed, PyTorch DDP, TensorFlow PS/Worker, Slurm, Ray, MPI, XGBoost. No need to build your own cluster.
  • Fault tolerance — AIMaster (the watchdog), EasyCKPT (the async checkpointer), SanityCheck (pre-flight node health), node self-healing.
  • Training acceleration — built-in framework with data parallelism, pipeline parallelism, operator splitting, automatic parallel-strategy exploration, topology-aware scheduling, optimized communication.

The first and third bullets are what make DLC interesting compared to renting GPU ECS yourself.

The job lifecycle

A DLC job goes through six phases between submit and done:

DLC job lifecycle

Two of those phases — scheduler places pods and mount OSS / NAS — are where almost all of the “my job is stuck in PENDING” tickets get filed. Stuck on schedule means your resource group is full; stuck on mount means your storage RAM role is wrong. Same as DSW, the diagnostic move is to spin up a tiny DSW with the same OSS mount and confirm oss ls works.

Picking a resource pool

You submit to one of three pools. The docs mostly talk about quotas and bills; the practical decision is about how tolerant your job is.

DLC resource pools

For most teams the answer is general-purpose, pay-as-you-go. Lingjun makes sense once you’re training above 8 GPUs and need RDMA between nodes — the docs note RDMA is configurable on Lingjun and gives “accelerated inter-node communication” (which is a polite way of saying NCCL AllReduce will be 5-10x faster than over standard ethernet). Preemptible is a cost saver for jobs that checkpoint cleanly, which thanks to EasyCKPT is most jobs.

A real distributed job

Here is the topology you build with a four-node, 2-GPU-per-node DLC job:

DLC distributed training topology

A minimal PyTorchJob-style submission via the SDK, scaled out from the MNIST notebook:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from pai.job import TrainingJob

job = TrainingJob(
    name="mnist-ddp",
    image_uri="dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/"
              "modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04",
    command=(
        "torchrun --nproc_per_node=2 --nnodes=$WORLD_SIZE "
        "--node_rank=$RANK --master_addr=$MASTER_ADDR "
        "--master_port=$MASTER_PORT /mnt/data/code/train_ddp.py"
    ),
    job_type="PyTorchJob",
    instance_count=4,                # 4 worker nodes
    instance_type="ecs.gn7i-c16g1.4xlarge",  # 2 x A10 each
    datasets={"train": "oss://your-bucket/datasets/mnist"},
    code_uri="oss://your-bucket/code/mnist-ddp.zip",
    output_uri="oss://your-bucket/runs/mnist-ddp/",
    fault_tolerance=True,            # turns on AIMaster
    enable_easyckpt=True,            # async checkpoint
)
job.submit(wait=False)
print(job.id, job.status)

A few things worth noting that are not obvious from a quick read of the docs:

  • $WORLD_SIZE, $RANK, $MASTER_ADDR, $MASTER_PORT are injected by DLC. You do not have to discover peers — DLC handles peer discovery and writes those env vars before your container starts. (See “Built-in environment variables” in the User Guide.)
  • fault_tolerance=True spins up an AIMaster sidecar that watches every worker. If a worker pod dies, AIMaster marks it, requests a replacement, and the surviving workers wait for it instead of crashing the whole job. This is the single most important toggle for jobs longer than a few hours.
  • enable_easyckpt=True swaps torch.save for an async path that writes to OSS without blocking the training step. On a 70B model this turns checkpointing from a 3-minute stall into about 10 seconds of overlap.
  • Image URL is region-specific. The dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com prefix only works inside the Shanghai VPC; use the matching one for your region or pulls will time out.

Watching it run

The console “Training Jobs” view gives you logs, GPU utilization, network throughput, and AIMaster events. From the SDK you can stream logs:

1
2
for line in job.tail_logs(follow=True):
    print(line)

For longer jobs I forward logs to SLS (Log Service) and set CloudMonitor alerts on gpu_util < 0.3 for 15 minutes — that is the canonical signal that something is wedged on data loading or distributed init.

Common failures and what they actually mean

SymptomReal cause
Job stuck in Pending for >5 minResource group full, or your quota is exhausted. Switch pool or reduce instance_count.
cannot mount oss://... at startupRAM role missing the AliyunPAIAccessingOSSRole attachment. Re-attach in workspace settings.
NCCL hangs at start of step 1RDMA misconfig on Lingjun, or a flaky node. Enable SanityCheck to isolate before the run.
Loss explodes on resume from checkpointEasyCKPT saved optimizer state but you did not load it. Read the EasyCKPT load helper, not torch.load.
Job finishes but output_uri is emptyYour training script wrote to /root instead of the mounted OSS path. Recheck OUTPUT_DIR.

Cost reality

For a typical 7B SFT (4 x A10, 6 hours) on general-purpose pay-as-you-go you’re looking at roughly the cost of an OK dinner in Shanghai. A 70B QLoRA (8 x A100 80 GB, 12 hours) is closer to a long weekend in Hangzhou. Preemptible cuts that 30-50% if your job can survive being killed every few hours — with EasyCKPT it can.

What’s next

Article 4 is EAS — taking whatever you trained and putting it behind an HTTP endpoint that auto-scales, mirrors traffic, and does not fall over at 3am. EAS is where most of your monthly Aliyun bill will live; it is worth getting right.

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub