<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Reinforcement-Learning on Chen Kai Blog</title><link>https://www.chenk.top/en/series/reinforcement-learning/</link><description>Recent content in Reinforcement-Learning on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 25 Sep 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/series/reinforcement-learning/index.xml" rel="self" type="application/rss+xml"/><item><title>Reinforcement Learning (12): RLHF and LLM Applications</title><link>https://www.chenk.top/en/reinforcement-learning/12-rlhf-and-llm-applications/</link><pubDate>Thu, 25 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/12-rlhf-and-llm-applications/</guid><description>&lt;p>GPT-3 (June 2020) and ChatGPT (November 2022) share most of their weights. The base model could write fluent prose, complete code, and continue any pattern you gave it. Yet, when asked a simple question, it might ramble, refuse for the wrong reasons, hallucinate citations, or produce toxic content. The two and a half years between GPT-3 and ChatGPT weren&amp;rsquo;t spent on larger transformers. Instead, they focused on &lt;strong>how to make the model useful&lt;/strong> — a reinforcement-learning problem.&lt;/p></description></item><item><title>Reinforcement Learning (11): Hierarchical RL and Meta-Learning</title><link>https://www.chenk.top/en/reinforcement-learning/11-hierarchical-and-meta-rl/</link><pubDate>Sat, 20 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/11-hierarchical-and-meta-rl/</guid><description>&lt;p>Standard RL treats every problem as a flat sequence of atomic decisions: observe state, pick an action, receive a reward, repeat. That works when the horizon is short and rewards are dense, but it breaks down on the kind of tasks humans solve effortlessly. &amp;ldquo;Make breakfast&amp;rdquo; is not one decision; it is a tree of subtasks &amp;mdash; &lt;em>brew coffee&lt;/em>, &lt;em>fry eggs&lt;/em>, &lt;em>toast bread&lt;/em>, &lt;em>plate it up&lt;/em> &amp;mdash; each of which is itself a small policy. &lt;strong>Hierarchical RL (HRL)&lt;/strong> lets agents reason and act at multiple timescales by treating macro-actions as first-class citizens.&lt;/p></description></item><item><title>Reinforcement Learning (10): Offline Reinforcement Learning</title><link>https://www.chenk.top/en/reinforcement-learning/10-offline-reinforcement-learning/</link><pubDate>Mon, 15 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/10-offline-reinforcement-learning/</guid><description>&lt;p>Every algorithm we&amp;rsquo;ve studied so far has the same core loop: act, observe, update. This loop makes RL work, but it also prevents RL from being deployed. A self-driving system can&amp;rsquo;t practice intersections by crashing. A clinical decision-support model can&amp;rsquo;t run a randomized policy on real patients. A factory robot can&amp;rsquo;t test ten thousand grasp variants on a production line.&lt;/p>
&lt;p>These settings do have logs — millions of hours of human driving, decades of de-identified patient records, and terabytes of behavior cloning data. &lt;strong>Offline RL&lt;/strong> (also called &lt;em>batch RL&lt;/em>) is the subfield that asks: can we extract a strong policy from a fixed dataset without any new interaction with the environment?&lt;/p></description></item><item><title>Reinforcement Learning (9): Multi-Agent Reinforcement Learning</title><link>https://www.chenk.top/en/reinforcement-learning/09-multi-agent-rl/</link><pubDate>Wed, 10 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/09-multi-agent-rl/</guid><description>&lt;p>Single-agent RL rests on one quiet but enormous assumption: the environment is stationary. The transition kernel does not change while the agent learns. The moment a second learner shares the world, that assumption collapses. Each agent now sees an environment whose dynamics shift as its peers update, rewards become entangled across agents, and the joint action space explodes combinatorially. These are not engineering nuisances. They are the reason multi-agent RL needs its own algorithms instead of just &lt;em>running DQN n times in parallel&lt;/em>.&lt;/p></description></item><item><title>Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search</title><link>https://www.chenk.top/en/reinforcement-learning/08-alphago-and-mcts/</link><pubDate>Fri, 05 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/08-alphago-and-mcts/</guid><description>&lt;p>In March 2016, AlphaGo defeated world Go champion Lee Sedol 4–1 in Seoul. The result was not just a sporting upset; it was the moment a 60-year programme in artificial intelligence — beating the world&amp;rsquo;s best at Go — concluded a full decade ahead of most published predictions. Go has roughly &lt;span class="math-inline">$10^{170}$&lt;/span>
 legal positions, more than the number of atoms in the observable universe. No amount of brute-force search will ever crack it. AlphaGo&amp;rsquo;s victory came from a different idea: let a deep network supply the &lt;em>intuition&lt;/em> about which moves look promising, and let Monte Carlo Tree Search (MCTS) supply the &lt;em>deliberation&lt;/em> that verifies and sharpens that intuition.&lt;/p></description></item><item><title>Reinforcement Learning (7): Imitation Learning and Inverse RL</title><link>https://www.chenk.top/en/reinforcement-learning/07-imitation-learning/</link><pubDate>Sun, 31 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/07-imitation-learning/</guid><description>&lt;p>Every algorithm in the previous chapters assumed access to a reward function. In practice, &lt;em>designing&lt;/em> that reward is often the hardest part of an RL project. Try writing one paragraph that captures &amp;ldquo;drive like a careful human&amp;rdquo;, &amp;ldquo;fold a shirt the way a tailor would&amp;rdquo;, or &amp;ldquo;summarise this document the way an expert editor would&amp;rdquo;. You can &lt;em>show&lt;/em> those behaviours far more easily than you can &lt;em>specify&lt;/em> them.&lt;/p>
&lt;p>Imitation learning takes that intuition seriously: instead of optimising a hand-engineered scalar, it learns from expert demonstrations &lt;span class="math-inline">$\mathcal{D} = \{(s_t, a_t)\}$&lt;/span>
. This chapter walks the four canonical methods — behavioral cloning, DAgger, maximum-entropy IRL, and GAIL/AIRL — not as isolated tricks but as a single ladder where each rung relaxes one assumption and pays for it with new structure.&lt;/p></description></item><item><title>Reinforcement Learning (6): PPO and TRPO — Trust Region Policy Optimization</title><link>https://www.chenk.top/en/reinforcement-learning/06-ppo-and-trpo/</link><pubDate>Tue, 26 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/06-ppo-and-trpo/</guid><description>&lt;p>Policy gradients (&lt;a href="https://www.chenk.top/en/reinforcement-learning/03-policy-gradient-and-actor-critic/">Part 3&lt;/a>
) optimise the policy directly, sidestepping discrete &lt;code>argmax&lt;/code> operators and naturally handling stochastic strategies. They have one fatal flaw: &lt;strong>a single overlong step can destroy the policy&lt;/strong>, and because the data distribution is &lt;em>coupled&lt;/em> to the policy, recovery is nearly impossible.&lt;/p>
&lt;p>&lt;strong>Trust-region methods&lt;/strong> make this concrete: bound the change in &lt;em>behaviour&lt;/em>, not in parameters, at every update. TRPO does this with a hard KL constraint and a second-order solver. PPO mimics the same effect with one line of clipped arithmetic. The simpler trick won: PPO trains OpenAI Five, ChatGPT&amp;rsquo;s RLHF stage, and almost every modern robotics policy, remaining the workhorse of applied deep RL.&lt;/p></description></item><item><title>Reinforcement Learning (5): Model-Based RL and World Models</title><link>https://www.chenk.top/en/reinforcement-learning/05-model-based-rl-and-world-models/</link><pubDate>Thu, 21 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/05-model-based-rl-and-world-models/</guid><description>&lt;p>Every algorithm we have covered so far — DQN, REINFORCE, A2C, PPO, SAC — is &lt;strong>model-free&lt;/strong>: the agent treats the environment as a black box, throws actions at it, and updates its policy from the rewards that come back. The approach works, but it is profligate. DQN needs roughly &lt;strong>10 million frames&lt;/strong> to master Atari Pong. OpenAI Five trained on Dota 2 for the equivalent of &lt;strong>~45,000 years&lt;/strong> of self-play. AlphaStar consumed years of StarCraft for a single agent.&lt;/p></description></item><item><title>Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning</title><link>https://www.chenk.top/en/reinforcement-learning/04-exploration-and-curiosity-driven-learning/</link><pubDate>Sat, 16 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/04-exploration-and-curiosity-driven-learning/</guid><description>&lt;p>Drop a fresh agent into Montezuma&amp;rsquo;s Revenge. To score a single point, it must walk to the right, jump over a skull, climb a rope, leap to a platform, and grab a key — roughly &lt;strong>a hundred precise actions in a row&lt;/strong>. Until the key is collected, the reward signal is always zero.&lt;/p>
&lt;p>A textbook DQN with &lt;span class="math-inline">$\varepsilon=0.1$&lt;/span>
 exploration has, by a generous estimate, a &lt;span class="math-inline">$0.1^{100} \approx 10^{-100}$&lt;/span>
 chance of stumbling onto that key by accident. Unsurprisingly, vanilla DQN scores &lt;strong>0&lt;/strong> on this game. Not &amp;ldquo;low&amp;rdquo; — literally zero, every episode, for the entire training run.&lt;/p></description></item><item><title>Reinforcement Learning (3): Policy Gradient and Actor-Critic Methods</title><link>https://www.chenk.top/en/reinforcement-learning/03-policy-gradient-and-actor-critic/</link><pubDate>Mon, 11 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/03-policy-gradient-and-actor-critic/</guid><description>&lt;p>DQN showed that deep RL can master Atari, but it has a hard ceiling: it only works in &lt;strong>discrete action spaces&lt;/strong>. Ask it to control a robot arm with seven continuous joint angles, and it fails — you&amp;rsquo;d have to solve an inner optimization problem every time you choose an action.&lt;/p>
&lt;p>&lt;strong>Policy gradient methods&lt;/strong> take a fundamentally different route. Instead of learning a value function and &lt;em>deriving&lt;/em> a policy from it, they &lt;strong>directly optimise the policy&lt;/strong>. That single change opens the door to continuous actions, stochastic strategies, and problems where the optimal play is itself random (think rock-paper-scissors).&lt;/p></description></item><item><title>Reinforcement Learning (2): Q-Learning and Deep Q-Networks (DQN)</title><link>https://www.chenk.top/en/reinforcement-learning/02-q-learning-and-dqn/</link><pubDate>Wed, 06 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/02-q-learning-and-dqn/</guid><description>&lt;p>In December 2013, a small DeepMind team uploaded a paper to arXiv with a striking claim: a single neural network, trained from raw pixels and the score, learned to play seven Atari games — and beat the previous best on six of them. No game-specific features. No hand-coded heuristics. The same architecture for Pong, Breakout, and Space Invaders. The algorithm was &lt;strong>Deep Q-Network (DQN)&lt;/strong>, and it kicked off the deep reinforcement learning era.&lt;/p></description></item><item><title>Reinforcement Learning (1): Fundamentals and Core Concepts</title><link>https://www.chenk.top/en/reinforcement-learning/01-fundamentals-and-core-concepts/</link><pubDate>Fri, 01 Aug 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/01-fundamentals-and-core-concepts/</guid><description>&lt;p>The first time you sat on a bicycle, nobody handed you a manual that said &lt;em>&amp;ldquo;if your tilt angle exceeds 7.4 degrees, apply 12% counter-steer.&amp;rdquo;&lt;/em> You wobbled, you over-corrected, you fell, you got back on. After a few hundred attempts your body simply &lt;em>knew&lt;/em> what to do, even though you could not put it into words.&lt;/p>
&lt;p>That trial-feedback-improvement loop is not just how we learn to ride bikes. It is how AlphaGo learned to defeat the world Go champion, how Boston Dynamics robots learn to walk, and how recommendation systems quietly improve every time you click. They all share one mathematical framework called &lt;strong>reinforcement learning&lt;/strong> (RL).&lt;/p></description></item></channel></rss>