Tagged

DPO

Sep 25, 2025 Reinforcement Learning 3 min read

Reinforcement Learning (12): RLHF and LLM Applications

How RLHF turned base language models into ChatGPT and Claude: the SFT→Reward-Model→PPO pipeline, the Bradley-Terry preference model, the DPO closed-form derivation, RLAIF and Constitutional AI, reward hacking and …