RLHF on Chen Kai Blog

LLM Engineering (4): Post-training — SFT, DPO, RLHF, RLAIF

Mon, 30 Mar 2026 09:00:00 +0000

A base model from pretraining can complete text but cannot follow instructions, refuse harmful requests, or maintain a persona—these are post-training behaviors. Post-training is where the gap between a research paper’s claims and a production-grade model lies. This chapter covers what each post-training algorithm optimizes, why most reward models are subtly flawed, and the effective methods for 2026.

Reinforcement Learning (12): RLHF and LLM Applications

Thu, 25 Sep 2025 09:00:00 +0000

GPT-3 (June 2020) and ChatGPT (November 2022) share most of their weights. The base model could write fluent prose, complete code, and continue any pattern you gave it. Yet, when asked a simple question, it might ramble, refuse for the wrong reasons, hallucinate citations, or produce toxic content. The two and a half years between GPT-3 and ChatGPT weren’t spent on larger transformers. Instead, they focused on how to make the model useful — a reinforcement-learning problem.

Reinforcement Learning (6): PPO and TRPO — Trust Region Policy Optimization

Tue, 26 Aug 2025 09:00:00 +0000

Policy gradients (Part 3 ) optimise the policy directly, sidestepping discrete argmax operators and naturally handling stochastic strategies. They have one fatal flaw: a single overlong step can destroy the policy, and because the data distribution is coupled to the policy, recovery is nearly impossible.

Trust-region methods make this concrete: bound the change in behaviour, not in parameters, at every update. TRPO does this with a hard KL constraint and a second-order solver. PPO mimics the same effect with one line of clipped arithmetic. The simpler trick won: PPO trains OpenAI Five, ChatGPT’s RLHF stage, and almost every modern robotics policy, remaining the workhorse of applied deep RL.