PPO

Aug 26, 2025 Reinforcement Learning 32 min read

Reinforcement Learning (6): PPO and TRPO — Trust Region Policy Optimization

Why PPO became the most widely used RL algorithm -- from TRPO's theoretical foundations through natural gradients to PPO's elegant clipping mechanism, plus its role in RLHF for large language models.

Aug 11, 2025 Reinforcement Learning 28 min read

Reinforcement Learning (3): Policy Gradient and Actor-Critic Methods

From REINFORCE to SAC -- how policy gradient methods directly optimize policies, naturally handle continuous actions, and power modern algorithms like PPO, TD3, and SAC.