Series · Aliyun PAI · Chapter 2

Aliyun PAI (2): PAI-DSW — Notebooks That Don't Eat Your Weights

Working with PAI-DSW for real: choosing the right GPU image, mounting OSS so you don't lose checkpoints when the instance restarts, and an MNIST notebook drawn from the official Quick Start that you can copy-paste.

Every time I onboard a new ML engineer to PAI the first day looks the same. They start a DSW instance, pip install their world, train for an hour, restart the kernel for some reason, and then ask me where their model file went. The honest answer — “in /root on a node that no longer exists” — is the kind of lesson you only need to learn once. This article is the version of that lesson you read in advance.

What DSW actually is

Per the official “DSW Overview”, DSW is a cloud-based IDE for AI development that integrates JupyterLab, VSCode, and a terminal, with pre-configured container images for PyTorch and TensorFlow, heterogeneous compute (CPU / GPU / Lingjun), and the ability to mount datasets from OSS, NAS, and CPFS. In practice that means you click “Open” and within a minute you have a real Jupyter on a real GPU with nvidia-smi working and PyTorch already importable.

What’s interesting is what’s not in the box. The DSW container has a system disk that lives only as long as the instance does. Anything you pip install survives a kernel restart but does not survive an instance restart unless you persist the conda env to OSS or save it to ACR via the snapshot feature.

Anatomy of a DSW instance

Picking an instance type

Per the docs, DSW resource types come in two flavors: public resources (pay-as-you-go) and dedicated resources (subscription on general-purpose compute or Lingjun). For day-to-day work, public is the right answer — you’re paying for the GPU minutes you actually use, and the per-second metering means you can spin one up for a 10-minute experiment and not care.

What I actually pick:

  • Tiny experiment / debuggingecs.gn7i-c8g1.2xlarge (1 × A10, 24 GB). Cheap, plenty for fine-tuning a 7B in 4-bit quant or running diffusion at 512×512.
  • Real training of a small modelecs.gn7i-c16g1.4xlarge or ecs.gn7e-c12g1.3xlarge (A10 / A100 40 GB). Comfortable for a CIFAR-10 ResNet, ImageNet-tiny, or a 7B SFT with QLoRA.
  • LLM devecs.gn7e-c12g1.6xlarge or higher (A100 80 GB). Required if you want to load a 13-30B in BF16 without offloading.

Real-world tip: If the GPU type you want is “out of stock” in the console, switch the AZ. Stock is per-AZ, not per-region. I have seen 80 GB A100 unavailable in cn-shanghai-h and free in cn-shanghai-l in the same minute.

The image catalog

DSW images are official, versioned, and tagged. The Quick Start uses modelscope:1.26.0-pytorch2.6.0-gpu-py311-cu124-ubuntu22.04 — that string tells you exactly what is inside. Read it left to right: ModelScope SDK 1.26, PyTorch 2.6, GPU build, Python 3.11, CUDA 12.4, Ubuntu 22.04.

I almost always pick a pytorch or modelscope image. The TensorFlow images are fine but lag a major release behind. There is also a dsw-stable family that lags by design — pick it for production-adjacent work where you do not want a CUDA bump in the middle of a training run.

You can also bake your own image and push it to ACR. I do this for projects with a heavy dependency tree (vllm, flash-attn, custom CUDA kernels) — saves four minutes of pip install every time someone starts a fresh instance.

A standard workflow that does not lose data

The console flow looks like this:

Standard DSW workflow

The lifecycle hooks are easy to ignore and expensive to forget. Idle shutdown at 30 minutes is my default; scheduled shutdown at 11pm catches the case where I leave a notebook running over the weekend. Every DSW idle GPU at 5 RMB an hour is roughly 100 RMB you owe Aliyun by Monday.

Where your data lives

The single most important diagram in the entire DSW docs:

DSW storage layout

The mount path I use everywhere:

/mnt/data/
├── datasets/      # read-only OSS mount (the bucket lives forever)
├── checkpoints/   # writeable OSS prefix (save every N steps)
└── code/          # git repo, also on OSS so a new instance is one mount away

Mounting OSS is configured at instance create time; the docs call it “Configure storage”. Pick the bucket and the prefix, choose mount path /mnt/data/, accept the default access mode (FUSE-backed). After launch, oss ls oss://your-bucket/ should work from the terminal — that is your “PAI ↔ OSS RAM role” health check.

A working MNIST notebook (straight from the Quick Start)

The official Quick Start uses MNIST handwritten digit recognition. Here is the minimum viable training cell, simplified for the article — the docs link to a full mnist.ipynb you can upload as-is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

tx = transforms.Compose([transforms.ToTensor(),
                          transforms.Normalize((0.1307,), (0.3081,))])
train = datasets.MNIST("/mnt/data/datasets", train=True,  download=True, transform=tx)
val   = datasets.MNIST("/mnt/data/datasets", train=False, download=True, transform=tx)

train_loader = DataLoader(train, batch_size=128, shuffle=True,  num_workers=2)
val_loader   = DataLoader(val,   batch_size=512, shuffle=False, num_workers=2)

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.c1, self.c2 = nn.Conv2d(1, 32, 3, 1), nn.Conv2d(32, 64, 3, 1)
        self.fc1, self.fc2 = nn.Linear(9216, 128), nn.Linear(128, 10)
    def forward(self, x):
        x = F.relu(self.c1(x)); x = F.max_pool2d(F.relu(self.c2(x)), 2)
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        return self.fc2(x)

model, opt = Net().to(device), torch.optim.AdamW(Net().parameters(), lr=1e-3)
opt = torch.optim.AdamW(model.parameters(), lr=1e-3)

for epoch in range(3):
    model.train()
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        opt.zero_grad(); F.cross_entropy(model(xb), yb).backward(); opt.step()
    # checkpoint to OSS, not /root
    torch.save(model.state_dict(), f"/mnt/data/checkpoints/mnist_e{epoch}.pt")
    print(f"epoch {epoch} done")

The Quick Start expects roughly 98% validation accuracy after 3 epochs on a single A10. If you see anything dramatically lower, you’ve probably mounted OSS wrong and are reading from the wrong directory — not a model bug.

TensorBoard inline

DSW ships TensorBoard as a built-in extension; the docs walk through enabling it from the menu. I usually just run it as a cell:

1
2
%load_ext tensorboard
%tensorboard --logdir /mnt/data/checkpoints/runs --port 6006

The link the docs tell you to click is http://localhost:6006/ — DSW proxies the port so it works in your browser through the DSW URL. If the port is “in use”, another notebook in the same instance is holding it; restart the kernel of the offender, not the instance.

Saving the env between sessions

DSW has two mechanisms here, both worth knowing:

  1. Instance image snapshot — bakes your current container state (installed packages, system files) to ACR. Next instance you start, pick that image and you are back where you left off. Slow (a few minutes) but exact.
  2. Conda env on OSS — install all your pip deps under /mnt/data/envs/myenv/ and activate it. Survives instance death without rebaking. Faster but does not capture system-level changes (apt install etc).

I default to the conda-on-OSS approach for project work and the snapshot mechanism for “frozen demo I want to show in 6 months”.

What’s next

Article 3 takes the same MNIST job and shows what changes when you scale it across multiple GPUs and multiple nodes via DLC — including the AIMaster fault tolerance that the docs mention but do not really explain.

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub