<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Reinforcement Learning on Chen Kai Blog</title><link>https://www.chenk.top/en/reinforcement-learning/</link><description>Recent content in Reinforcement Learning on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 25 Sep 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/reinforcement-learning/index.xml" rel="self" type="application/rss+xml"/><item><title>Reinforcement Learning (12): RLHF and LLM Applications</title><link>https://www.chenk.top/en/reinforcement-learning/12-rlhf-and-llm-applications/</link><pubDate>Thu, 25 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/12-rlhf-and-llm-applications/</guid><description>&lt;p>GPT-3 (June 2020) and ChatGPT (November 2022) share most of their weights. The base model could write fluent prose, complete code, and continue any pattern you gave it — and yet, asked a plain question, it would happily ramble, refuse for the wrong reasons, hallucinate citations, or produce a paragraph of toxicity. The two and a half years between them were not spent on bigger transformers. They were spent learning &lt;strong>how to ask the model to be useful&lt;/strong> — and that turned out to be a reinforcement-learning problem.&lt;/p></description></item><item><title>Reinforcement Learning (11): Hierarchical RL and Meta-Learning</title><link>https://www.chenk.top/en/reinforcement-learning/11-hierarchical-and-meta-rl/</link><pubDate>Sat, 20 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/11-hierarchical-and-meta-rl/</guid><description>&lt;p>Standard RL treats every problem as a flat sequence of atomic decisions: observe state, pick an action, receive a reward, repeat. That works when the horizon is short and rewards are dense, but it breaks down on the kind of tasks humans solve effortlessly. &amp;ldquo;Make breakfast&amp;rdquo; is not one decision; it is a tree of subtasks &amp;mdash; &lt;em>brew coffee&lt;/em>, &lt;em>fry eggs&lt;/em>, &lt;em>toast bread&lt;/em>, &lt;em>plate it up&lt;/em> &amp;mdash; each of which is itself a small policy. &lt;strong>Hierarchical RL (HRL)&lt;/strong> lets agents reason and act at multiple timescales by treating macro-actions as first-class citizens.&lt;/p></description></item><item><title>Reinforcement Learning (10): Offline Reinforcement Learning</title><link>https://www.chenk.top/en/reinforcement-learning/10-offline-reinforcement-learning/</link><pubDate>Mon, 15 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/10-offline-reinforcement-learning/</guid><description>&lt;p>Every algorithm we have studied so far has the same loop at its core: act, observe, update. That loop is what makes RL work, but it is also what stops RL from being deployed. A self-driving stack cannot rehearse intersections by crashing into them. A clinical decision-support model cannot run a randomized policy on actual patients. A factory robot cannot try ten thousand grasp variants on a production line.&lt;/p>
&lt;p>What these settings &lt;em>do&lt;/em> have is logs &amp;ndash; millions of hours of human driving, decades of de-identified patient records, terabytes of behavior cloning data. &lt;strong>Offline RL&lt;/strong> (also called &lt;em>batch RL&lt;/em>) is the subfield that asks: can we squeeze a strong policy out of a fixed dataset, with &lt;strong>zero new interaction&lt;/strong> with the environment?&lt;/p></description></item><item><title>Reinforcement Learning (9): Multi-Agent Reinforcement Learning</title><link>https://www.chenk.top/en/reinforcement-learning/09-multi-agent-rl/</link><pubDate>Wed, 10 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/09-multi-agent-rl/</guid><description>&lt;p>Single-agent RL rests on one quiet but enormous assumption: the environment is stationary. The transition kernel does not change while the agent learns. The moment a second learner shares the world, that assumption collapses. Each agent now sees an environment whose dynamics shift as its peers update, rewards become entangled across agents, and the joint action space explodes combinatorially. These are not engineering nuisances. They are the reason multi-agent RL needs its own algorithms instead of just &lt;em>running DQN n times in parallel&lt;/em>.&lt;/p></description></item><item><title>Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search</title><link>https://www.chenk.top/en/reinforcement-learning/08-alphago-and-mcts/</link><pubDate>Fri, 05 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/08-alphago-and-mcts/</guid><description>&lt;p>In March 2016, AlphaGo defeated world Go champion Lee Sedol 4–1 in Seoul. The result was not just a sporting upset; it was the moment a 60-year programme in artificial intelligence — beating the world&amp;rsquo;s best at Go — concluded a full decade ahead of most published predictions. Go has roughly $10^{170}$ legal positions, more than the number of atoms in the observable universe. No amount of brute-force search will ever crack it. AlphaGo&amp;rsquo;s victory came from a different idea: let a deep network supply the &lt;em>intuition&lt;/em> about which moves look promising, and let Monte Carlo Tree Search (MCTS) supply the &lt;em>deliberation&lt;/em> that verifies and sharpens that intuition.&lt;/p></description></item><item><title>Reinforcement Learning (7): Imitation Learning and Inverse RL</title><link>https://www.chenk.top/en/reinforcement-learning/07-imitation-learning/</link><pubDate>Sun, 31 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/07-imitation-learning/</guid><description>&lt;p>Every algorithm in the previous chapters assumed access to a reward function. In practice, &lt;em>designing&lt;/em> that reward is often the hardest part of an RL project. Try writing one paragraph that captures &amp;ldquo;drive like a careful human&amp;rdquo;, &amp;ldquo;fold a shirt the way a tailor would&amp;rdquo;, or &amp;ldquo;summarise this document the way an expert editor would&amp;rdquo;. You can &lt;em>show&lt;/em> those behaviours far more easily than you can &lt;em>specify&lt;/em> them.&lt;/p>
&lt;p>Imitation learning takes that intuition seriously: instead of optimising a hand-engineered scalar, it learns from expert demonstrations $\mathcal{D} = \{(s_t, a_t)\}$. This chapter walks the four canonical methods &amp;ndash; behavioral cloning, DAgger, maximum-entropy IRL, and GAIL/AIRL &amp;ndash; not as isolated tricks but as a single ladder where each rung relaxes one assumption and pays for it with new structure.&lt;/p></description></item><item><title>Reinforcement Learning (6): PPO and TRPO -- Trust Region Policy Optimization</title><link>https://www.chenk.top/en/reinforcement-learning/06-ppo-and-trpo/</link><pubDate>Tue, 26 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/06-ppo-and-trpo/</guid><description>&lt;p>Policy gradients (Part 3) optimise the policy directly, sidestepping discrete &lt;code>argmax&lt;/code> operators and naturally handling stochastic strategies. They have one fatal flaw: &lt;strong>a single overlong step can destroy the policy&lt;/strong>, and because the data distribution is &lt;em>coupled&lt;/em> to the policy, recovery is nearly impossible.&lt;/p>
&lt;p>&lt;strong>Trust-region methods&lt;/strong> make this concrete: bound the change in &lt;em>behaviour&lt;/em>, not in parameters, at every update. TRPO does it through a hard KL constraint and a second-order solver. PPO mimics the same effect with one line of clipped arithmetic. The cheaper trick won: PPO trains OpenAI Five, ChatGPT&amp;rsquo;s RLHF stage, almost every modern robotics policy, and remains the workhorse of applied deep RL.&lt;/p></description></item><item><title>Reinforcement Learning (5): Model-Based RL and World Models</title><link>https://www.chenk.top/en/reinforcement-learning/05-model-based-rl-and-world-models/</link><pubDate>Thu, 21 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/05-model-based-rl-and-world-models/</guid><description>&lt;p>Every algorithm we have covered so far &amp;ndash; DQN, REINFORCE, A2C, PPO, SAC &amp;ndash; is &lt;strong>model-free&lt;/strong>: the agent treats the environment as a black box, throws actions at it, and updates its policy from the rewards that come back. The approach works, but it is profligate. DQN needs roughly &lt;strong>10 million frames&lt;/strong> to master Atari Pong. OpenAI Five trained on Dota 2 for the equivalent of &lt;strong>~45,000 years&lt;/strong> of self-play. AlphaStar consumed years of StarCraft for a single agent.&lt;/p></description></item><item><title>Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning</title><link>https://www.chenk.top/en/reinforcement-learning/04-exploration-and-curiosity-driven-learning/</link><pubDate>Sat, 16 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/04-exploration-and-curiosity-driven-learning/</guid><description>&lt;p>Drop a fresh agent into Montezuma&amp;rsquo;s Revenge. To score a single point it must walk to the right, jump a skull, climb a rope, leap to a platform, and grab a key &amp;ndash; roughly &lt;strong>a hundred precise actions in a row&lt;/strong>. Until that key is collected, every reward signal is exactly zero.&lt;/p>
&lt;p>A textbook DQN with $\varepsilon=0.1$ exploration has, by a generous estimate, a $0.1^{100} \approx 10^{-100}$ chance of stumbling onto that key by accident. Unsurprisingly, vanilla DQN scores &lt;strong>0&lt;/strong> on this game. Not &amp;ldquo;low&amp;rdquo; &amp;ndash; literally zero, every episode, for the entire training run.&lt;/p></description></item><item><title>Reinforcement Learning (3): Policy Gradient and Actor-Critic Methods</title><link>https://www.chenk.top/en/reinforcement-learning/03-policy-gradient-and-actor-critic/</link><pubDate>Mon, 11 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/03-policy-gradient-and-actor-critic/</guid><description>&lt;p>DQN proved that deep RL can master Atari, but it has a hard ceiling: it only works in &lt;strong>discrete action spaces&lt;/strong>. Ask it to control a robot arm with seven continuous joint angles and it falls apart &amp;ndash; you would have to solve an inner optimisation problem every time you choose an action.&lt;/p>
&lt;p>&lt;strong>Policy gradient methods&lt;/strong> take a fundamentally different route. Instead of learning a value function and &lt;em>deriving&lt;/em> a policy from it, they &lt;strong>directly optimise the policy&lt;/strong>. That single change opens the door to continuous actions, stochastic strategies, and problems where the optimal play is itself random (think rock-paper-scissors).&lt;/p></description></item><item><title>Reinforcement Learning (2): Q-Learning and Deep Q-Networks (DQN)</title><link>https://www.chenk.top/en/reinforcement-learning/02-q-learning-and-dqn/</link><pubDate>Wed, 06 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/02-q-learning-and-dqn/</guid><description>&lt;p>In December 2013, a small DeepMind team uploaded a paper to arXiv with a striking claim: a single neural network, trained from raw pixels and the score, learned to play seven Atari games &amp;ndash; and beat the previous best on six of them. No game-specific features. No hand-coded heuristics. The same architecture for Pong, Breakout, and Space Invaders. The algorithm was &lt;strong>Deep Q-Network (DQN)&lt;/strong>, and it kicked off the deep reinforcement learning era.&lt;/p></description></item><item><title>Reinforcement Learning (1): Fundamentals and Core Concepts</title><link>https://www.chenk.top/en/reinforcement-learning/01-fundamentals-and-core-concepts/</link><pubDate>Fri, 01 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/01-fundamentals-and-core-concepts/</guid><description>&lt;p>The first time you sat on a bicycle, nobody handed you a manual that said &lt;em>&amp;ldquo;if your tilt angle exceeds 7.4 degrees, apply 12% counter-steer.&amp;rdquo;&lt;/em> You wobbled, you over-corrected, you fell, you got back on. After a few hundred attempts your body simply &lt;em>knew&lt;/em> what to do, even though you could not put it into words.&lt;/p>
&lt;p>That trial-feedback-improvement loop is not just how we learn to ride bikes. It is how AlphaGo learned to defeat the world Go champion, how Boston Dynamics robots learn to walk, and how recommendation systems quietly improve every time you click. They all share one mathematical framework called &lt;strong>reinforcement learning&lt;/strong> (RL).&lt;/p></description></item></channel></rss>