Reinforcement Learning
Reinforcement Learning (12): RLHF and LLM Applications
How RLHF turned base language models into ChatGPT and Claude: the SFT→Reward-Model→PPO pipeline, the Bradley-Terry preference model, the DPO closed-form derivation, RLAIF and Constitutional AI, reward hacking and …
Reinforcement Learning (11): Hierarchical RL and Meta-Learning
A deep dive into hierarchical RL (Options, MAXQ, Feudal Networks, goal-conditioned policies) and meta-RL (MAML, FOMAML, RL^2). Covers temporal abstraction, semi-MDPs, manager-worker architectures, second-order …
Reinforcement Learning (10): Offline Reinforcement Learning
Master offline RL: learn policies from fixed datasets without environment interaction. Covers distributional shift, Conservative Q-Learning (CQL), BCQ, Implicit Q-Learning (IQL), Decision Transformer, with a complete CQL …
Reinforcement Learning (9): Multi-Agent Reinforcement Learning
A working tour of multi-agent RL: Markov games, the non-stationarity and credit-assignment problems, CTDE, value decomposition (VDN, QMIX), counterfactual baselines (COMA), MADDPG, communication topologies, and the …
Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search
From MCTS to AlphaGo, AlphaGo Zero, AlphaZero, and MuZero. Understand UCT exploration-exploitation, self-play training, and planning with learned models. Includes a complete AlphaZero implementation for Gomoku.
Reinforcement Learning (7): Imitation Learning and Inverse RL
A practical, theory-grounded tour of imitation learning: behavioral cloning and its quadratic compounding error, DAgger and the no-regret reduction, MaxEnt inverse RL for recovering reward functions, and adversarial …
Reinforcement Learning (6): PPO and TRPO -- Trust Region Policy Optimization
Why PPO became the most widely used RL algorithm -- from TRPO's theoretical foundations through natural gradients to PPO's elegant clipping mechanism, plus its role in RLHF for large language models.
Reinforcement Learning (5): Model-Based RL and World Models
From Dyna and MBPO to World Models, Dreamer, and MuZero -- how learning a model lets agents plan in imagination and reach expert performance with 10-100x fewer real interactions.
Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning
How do RL agents discover rewards when the environment gives almost no feedback? From count-based methods to ICM, RND, and NGU -- the science of curiosity-driven exploration.
Reinforcement Learning (3): Policy Gradient and Actor-Critic Methods
From REINFORCE to SAC -- how policy gradient methods directly optimize policies, naturally handle continuous actions, and power modern algorithms like PPO, TD3, and SAC.
Reinforcement Learning (2): Q-Learning and Deep Q-Networks (DQN)
How DQN combined neural networks with Q-Learning to master Atari games -- covering experience replay, target networks, Double DQN, Dueling DQN, Prioritized Experience Replay, and Rainbow.
Reinforcement Learning (1): Fundamentals and Core Concepts
A beginner-friendly guide to the mathematical foundations of reinforcement learning -- MDPs, Bellman equations, dynamic programming, Monte Carlo methods, and temporal difference learning -- with working Python code, all …