<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Distributed Training on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/distributed-training/</link><description>Recent content in Distributed Training on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 07 Mar 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/distributed-training/index.xml" rel="self" type="application/rss+xml"/><item><title>Aliyun PAI (3): PAI-DLC — Distributed Training Without the Cluster Pain</title><link>https://www.chenk.top/en/aliyun-pai/03-pai-dlc-distributed-training/</link><pubDate>Sat, 07 Mar 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/aliyun-pai/03-pai-dlc-distributed-training/</guid><description>&lt;p>A DSW notebook is for one engineer on one GPU. When you need eight GPUs across two nodes or training that runs longer than eight hours, you switch to &lt;strong>DLC&lt;/strong>. DLC is PAI&amp;rsquo;s job-submission front-end for a managed Kubernetes cluster. You describe what you want (image, command, resources, data mounts), and DLC schedules pods, runs them to completion, persists logs, and reports the results. The docs call this &lt;em>Deep Learning Containers&lt;/em>; we just say &amp;ldquo;DLC job&amp;rdquo;.&lt;/p></description></item></channel></rss>