Reinforcement-Learning on Chen Kai Blog

Reinforcement Learning (12): RLHF and LLM Applications

Thu, 25 Sep 2025 09:00:00 +0000

GPT-3 (June 2020) and ChatGPT (November 2022) share most of their weights. The base model could write fluent prose, complete code, and continue any pattern you gave it. Yet, when asked a simple question, it might ramble, refuse for the wrong reasons, hallucinate citations, or produce toxic content. The two and a half years between GPT-3 and ChatGPT weren’t spent on larger transformers. Instead, they focused on how to make the model useful — a reinforcement-learning problem.

Reinforcement Learning (11): Hierarchical RL and Meta-Learning

Sat, 20 Sep 2025 09:00:00 +0000

Standard RL treats every problem as a flat sequence of atomic decisions: observe state, pick an action, receive a reward, repeat. That works when the horizon is short and rewards are dense, but it breaks down on the kind of tasks humans solve effortlessly. “Make breakfast” is not one decision; it is a tree of subtasks — brew coffee, fry eggs, toast bread, plate it up — each of which is itself a small policy. Hierarchical RL (HRL) lets agents reason and act at multiple timescales by treating macro-actions as first-class citizens.

Reinforcement Learning (10): Offline Reinforcement Learning

Mon, 15 Sep 2025 09:00:00 +0000

Every algorithm we’ve studied so far has the same core loop: act, observe, update. This loop makes RL work, but it also prevents RL from being deployed. A self-driving system can’t practice intersections by crashing. A clinical decision-support model can’t run a randomized policy on real patients. A factory robot can’t test ten thousand grasp variants on a production line.

These settings do have logs — millions of hours of human driving, decades of de-identified patient records, and terabytes of behavior cloning data. Offline RL (also called batch RL) is the subfield that asks: can we extract a strong policy from a fixed dataset without any new interaction with the environment?

Reinforcement Learning (9): Multi-Agent Reinforcement Learning

Wed, 10 Sep 2025 09:00:00 +0000

Single-agent RL rests on one quiet but enormous assumption: the environment is stationary. The transition kernel does not change while the agent learns. The moment a second learner shares the world, that assumption collapses. Each agent now sees an environment whose dynamics shift as its peers update, rewards become entangled across agents, and the joint action space explodes combinatorially. These are not engineering nuisances. They are the reason multi-agent RL needs its own algorithms instead of just running DQN n times in parallel.

Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search

Fri, 05 Sep 2025 09:00:00 +0000

In March 2016, AlphaGo defeated world Go champion Lee Sedol 4–1 in Seoul. The result was not just a sporting upset; it was the moment a 60-year programme in artificial intelligence — beating the world’s best at Go — concluded a full decade ahead of most published predictions. Go has roughly $10^{170}$ legal positions, more than the number of atoms in the observable universe. No amount of brute-force search will ever crack it. AlphaGo’s victory came from a different idea: let a deep network supply the intuition about which moves look promising, and let Monte Carlo Tree Search (MCTS) supply the deliberation that verifies and sharpens that intuition.

Reinforcement Learning (7): Imitation Learning and Inverse RL

Sun, 31 Aug 2025 09:00:00 +0000

Every algorithm in the previous chapters assumed access to a reward function. In practice, designing that reward is often the hardest part of an RL project. Try writing one paragraph that captures “drive like a careful human”, “fold a shirt the way a tailor would”, or “summarise this document the way an expert editor would”. You can show those behaviours far more easily than you can specify them.

Imitation learning takes that intuition seriously: instead of optimising a hand-engineered scalar, it learns from expert demonstrations $\mathcal{D} = \{(s_t, a_t)\}$ . This chapter walks the four canonical methods — behavioral cloning, DAgger, maximum-entropy IRL, and GAIL/AIRL — not as isolated tricks but as a single ladder where each rung relaxes one assumption and pays for it with new structure.

Reinforcement Learning (6): PPO and TRPO — Trust Region Policy Optimization

Tue, 26 Aug 2025 09:00:00 +0000

Policy gradients (Part 3 ) optimise the policy directly, sidestepping discrete argmax operators and naturally handling stochastic strategies. They have one fatal flaw: a single overlong step can destroy the policy, and because the data distribution is coupled to the policy, recovery is nearly impossible.

Trust-region methods make this concrete: bound the change in behaviour, not in parameters, at every update. TRPO does this with a hard KL constraint and a second-order solver. PPO mimics the same effect with one line of clipped arithmetic. The simpler trick won: PPO trains OpenAI Five, ChatGPT’s RLHF stage, and almost every modern robotics policy, remaining the workhorse of applied deep RL.

Reinforcement Learning (5): Model-Based RL and World Models

Thu, 21 Aug 2025 09:00:00 +0000

Every algorithm we have covered so far — DQN, REINFORCE, A2C, PPO, SAC — is model-free: the agent treats the environment as a black box, throws actions at it, and updates its policy from the rewards that come back. The approach works, but it is profligate. DQN needs roughly 10 million frames to master Atari Pong. OpenAI Five trained on Dota 2 for the equivalent of ~45,000 years of self-play. AlphaStar consumed years of StarCraft for a single agent.

Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning

Sat, 16 Aug 2025 09:00:00 +0000

Drop a fresh agent into Montezuma’s Revenge. To score a single point, it must walk to the right, jump over a skull, climb a rope, leap to a platform, and grab a key — roughly a hundred precise actions in a row. Until the key is collected, the reward signal is always zero.

A textbook DQN with $\varepsilon=0.1$ exploration has, by a generous estimate, a $0.1^{100} \approx 10^{-100}$ chance of stumbling onto that key by accident. Unsurprisingly, vanilla DQN scores 0 on this game. Not “low” — literally zero, every episode, for the entire training run.

Reinforcement Learning (3): Policy Gradient and Actor-Critic Methods

Mon, 11 Aug 2025 09:00:00 +0000

DQN showed that deep RL can master Atari, but it has a hard ceiling: it only works in discrete action spaces. Ask it to control a robot arm with seven continuous joint angles, and it fails — you’d have to solve an inner optimization problem every time you choose an action.

Policy gradient methods take a fundamentally different route. Instead of learning a value function and deriving a policy from it, they directly optimise the policy. That single change opens the door to continuous actions, stochastic strategies, and problems where the optimal play is itself random (think rock-paper-scissors).

Reinforcement Learning (2): Q-Learning and Deep Q-Networks (DQN)

Wed, 06 Aug 2025 09:00:00 +0000

In December 2013, a small DeepMind team uploaded a paper to arXiv with a striking claim: a single neural network, trained from raw pixels and the score, learned to play seven Atari games — and beat the previous best on six of them. No game-specific features. No hand-coded heuristics. The same architecture for Pong, Breakout, and Space Invaders. The algorithm was Deep Q-Network (DQN), and it kicked off the deep reinforcement learning era.

Reinforcement Learning (1): Fundamentals and Core Concepts

Fri, 01 Aug 2025 09:00:00 +0000

The first time you sat on a bicycle, nobody handed you a manual that said “if your tilt angle exceeds 7.4 degrees, apply 12% counter-steer.” You wobbled, you over-corrected, you fell, you got back on. After a few hundred attempts your body simply knew what to do, even though you could not put it into words.

That trial-feedback-improvement loop is not just how we learn to ride bikes. It is how AlphaGo learned to defeat the world Go champion, how Boston Dynamics robots learn to walk, and how recommendation systems quietly improve every time you click. They all share one mathematical framework called reinforcement learning (RL).