LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF

Mon, 30 Mar 2026 09:00:00 +0000

A base model from pretraining can complete text but cannot follow instructions, refuse harmful requests, or maintain a persona—these are post-training behaviors. Post-training is where the gap between a research paper’s claims and a production-grade model lies. This chapter covers what each post-training algorithm optimizes, why most reward models are subtly flawed, and the effective methods for 2026.

Reinforcement Learning (12): RLHF and LLM Applications

Thu, 25 Sep 2025 09:00:00 +0000

GPT-3 (June 2020) and ChatGPT (November 2022) share most of their weights. The base model could write fluent prose, complete code, and continue any pattern you gave it. Yet, when asked a simple question, it might ramble, refuse for the wrong reasons, hallucinate citations, or produce toxic content. The two and a half years between GPT-3 and ChatGPT weren’t spent on larger transformers. Instead, they focused on how to make the model useful — a reinforcement-learning problem.

DPO on Chen Kai Blog

LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF

Reinforcement Learning (12): RLHF and LLM Applications