LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF

Mon, 30 Mar 2026 09:00:00 +0000

A base model from pretraining can complete text but cannot follow instructions, refuse harmful requests, or maintain a persona—these are post-training behaviors. Post-training is where the gap between a research paper’s claims and a production-grade model lies. This chapter covers what each post-training algorithm optimizes, why most reward models are subtly flawed, and the effective methods for 2026.

Aliyun PAI (3): PAI-DLC — Distributed Training Without the Cluster Pain

Sat, 07 Mar 2026 09:00:00 +0000

A DSW notebook is for one engineer on one GPU. When you need eight GPUs across two nodes or training that runs longer than eight hours, you switch to DLC. DLC is PAI’s job-submission front-end for a managed Kubernetes cluster. You describe what you want (image, command, resources, data mounts), and DLC schedules pods, runs them to completion, persists logs, and reports the results. The docs call this Deep Learning Containers; we just say “DLC job”.

SFT on Chen Kai Blog

LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF

Aliyun PAI (3): PAI-DLC — Distributed Training Without the Cluster Pain