Reinforcement Learning

Sep 25, 2025 Reinforcement Learning 40 min read

Reinforcement Learning (12): RLHF and LLM Applications

How RLHF turned base language models into ChatGPT and Claude: the SFT→Reward-Model→PPO pipeline, the Bradley-Terry preference model, the DPO closed-form derivation, RLAIF and Constitutional AI, reward hacking and …

Sep 20, 2025 Reinforcement Learning 24 min read

Reinforcement Learning (11): Hierarchical RL and Meta-Learning

A deep dive into hierarchical RL (Options, MAXQ, Feudal Networks, goal-conditioned policies) and meta-RL (MAML, FOMAML, RL^2). Covers temporal abstraction, semi-MDPs, manager-worker architectures, second-order …

Sep 15, 2025 Reinforcement Learning 26 min read

Reinforcement Learning (10): Offline Reinforcement Learning

Master offline RL: learn policies from fixed datasets without environment interaction. Covers distributional shift, Conservative Q-Learning (CQL), BCQ, Implicit Q-Learning (IQL), Decision Transformer, with a complete CQL …

Sep 10, 2025 Reinforcement Learning 28 min read

Reinforcement Learning (9): Multi-Agent Reinforcement Learning

A working tour of multi-agent RL: Markov games, the non-stationarity and credit-assignment problems, CTDE, value decomposition (VDN, QMIX), counterfactual baselines (COMA), MADDPG, communication topologies, and the …

Sep 5, 2025 Reinforcement Learning 28 min read

Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search

From MCTS to AlphaGo, AlphaGo Zero, AlphaZero, and MuZero. Understand UCT exploration-exploitation, self-play training, and planning with learned models. Includes a complete AlphaZero implementation for Gomoku.

Aug 31, 2025 Reinforcement Learning 28 min read

Reinforcement Learning (7): Imitation Learning and Inverse RL

A practical, theory-grounded tour of imitation learning: behavioral cloning and its quadratic compounding error, DAgger and the no-regret reduction, MaxEnt inverse RL for recovering reward functions, and adversarial …

Aug 26, 2025 Reinforcement Learning 32 min read

Reinforcement Learning (6): PPO and TRPO — Trust Region Policy Optimization

Why PPO became the most widely used RL algorithm -- from TRPO's theoretical foundations through natural gradients to PPO's elegant clipping mechanism, plus its role in RLHF for large language models.

Aug 21, 2025 Reinforcement Learning 28 min read

Reinforcement Learning (5): Model-Based RL and World Models

From Dyna and MBPO to World Models, Dreamer, and MuZero -- how learning a model lets agents plan in imagination and reach expert performance with 10-100x fewer real interactions.

Aug 16, 2025 Reinforcement Learning 34 min read

Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning

How do RL agents discover rewards when the environment gives almost no feedback? From count-based methods to ICM, RND, and NGU -- the science of curiosity-driven exploration.

Aug 11, 2025 Reinforcement Learning 28 min read

Reinforcement Learning (3): Policy Gradient and Actor-Critic Methods

From REINFORCE to SAC -- how policy gradient methods directly optimize policies, naturally handle continuous actions, and power modern algorithms like PPO, TD3, and SAC.

Aug 6, 2025 Reinforcement Learning 32 min read

Reinforcement Learning (2): Q-Learning and Deep Q-Networks (DQN)

How DQN combined neural networks with Q-Learning to master Atari games -- covering experience replay, target networks, Double DQN, Dueling DQN, Prioritized Experience Replay, and Rainbow.

Aug 1, 2025 Reinforcement Learning 38 min read

Reinforcement Learning (1): Fundamentals and Core Concepts

A beginner-friendly guide to the mathematical foundations of reinforcement learning -- MDPs, Bellman equations, dynamic programming, Monte Carlo methods, and temporal difference learning -- with working Python code, all …