Tagged
RLHF
Reinforcement Learning (12): RLHF and LLM Applications
How RLHF turned base language models into ChatGPT and Claude: the SFT→Reward-Model→PPO pipeline, the Bradley-Terry preference model, the DPO closed-form derivation, RLAIF and Constitutional AI, reward hacking and …
Reinforcement Learning (6): PPO and TRPO -- Trust Region Policy Optimization
Why PPO became the most widely used RL algorithm -- from TRPO's theoretical foundations through natural gradients to PPO's elegant clipping mechanism, plus its role in RLHF for large language models.