
Reinforcement Learning
Foundations of RL: MDPs, policy gradients, actor-critic, and offline RL.
01Reinforcement Learning (1): Fundamentals and Core Concepts
A beginner-friendly guide to the mathematical foundations of reinforcement learning -- MDPs, Bellman equations, dynamic …
02Reinforcement Learning (2): Q-Learning and Deep Q-Networks (DQN)
How DQN combined neural networks with Q-Learning to master Atari games -- covering experience replay, target networks, …
03Reinforcement Learning (3): Policy Gradient and Actor-Critic Methods
From REINFORCE to SAC -- how policy gradient methods directly optimize policies, naturally handle continuous actions, …
04Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning
How do RL agents discover rewards when the environment gives almost no feedback? From count-based methods to ICM, RND, …
05Reinforcement Learning (5): Model-Based RL and World Models
From Dyna and MBPO to World Models, Dreamer, and MuZero -- how learning a model lets agents plan in imagination and …
06Reinforcement Learning (6): PPO and TRPO — Trust Region Policy Optimization
Why PPO became the most widely used RL algorithm -- from TRPO's theoretical foundations through natural gradients to …
07Reinforcement Learning (7): Imitation Learning and Inverse RL
A practical, theory-grounded tour of imitation learning: behavioral cloning and its quadratic compounding error, DAgger …
08Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search
From MCTS to AlphaGo, AlphaGo Zero, AlphaZero, and MuZero. Understand UCT exploration-exploitation, self-play training, …
09Reinforcement Learning (9): Multi-Agent Reinforcement Learning
A working tour of multi-agent RL: Markov games, the non-stationarity and credit-assignment problems, CTDE, value …
10Reinforcement Learning (10): Offline Reinforcement Learning
Master offline RL: learn policies from fixed datasets without environment interaction. Covers distributional shift, …
11Reinforcement Learning (11): Hierarchical RL and Meta-Learning
A deep dive into hierarchical RL (Options, MAXQ, Feudal Networks, goal-conditioned policies) and meta-RL (MAML, FOMAML, …
12Reinforcement Learning (12): RLHF and LLM Applications
How RLHF turned base language models into ChatGPT and Claude: the SFT→Reward-Model→PPO pipeline, the Bradley-Terry …