Reinforcement Learning (6): PPO and TRPO — Trust Region Policy Optimization

Tue, 26 Aug 2025 09:00:00 +0000

Policy gradients (Part 3 ) optimise the policy directly, sidestepping discrete argmax operators and naturally handling stochastic strategies. They have one fatal flaw: a single overlong step can destroy the policy, and because the data distribution is coupled to the policy, recovery is nearly impossible.

Trust-region methods make this concrete: bound the change in behaviour, not in parameters, at every update. TRPO does this with a hard KL constraint and a second-order solver. PPO mimics the same effect with one line of clipped arithmetic. The simpler trick won: PPO trains OpenAI Five, ChatGPT’s RLHF stage, and almost every modern robotics policy, remaining the workhorse of applied deep RL.

Trust Region on Chen Kai Blog

Reinforcement Learning (6): PPO and TRPO — Trust Region Policy Optimization