Tagged
PPO
Reinforcement Learning (6): PPO and TRPO -- Trust Region Policy Optimization
Why PPO became the most widely used RL algorithm -- from TRPO's theoretical foundations through natural gradients to PPO's elegant clipping mechanism, plus its role in RLHF for large language models.
Reinforcement Learning (3): Policy Gradient and Actor-Critic Methods
From REINFORCE to SAC -- how policy gradient methods directly optimize policies, naturally handle continuous actions, and power modern algorithms like PPO, TD3, and SAC.