<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>CQL on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/cql/</link><description>Recent content in CQL on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Mon, 15 Sep 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/cql/index.xml" rel="self" type="application/rss+xml"/><item><title>Reinforcement Learning (10): Offline Reinforcement Learning</title><link>https://www.chenk.top/en/reinforcement-learning/10-offline-reinforcement-learning/</link><pubDate>Mon, 15 Sep 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/reinforcement-learning/10-offline-reinforcement-learning/</guid><description>&lt;p>Every algorithm we&amp;rsquo;ve studied so far has the same core loop: act, observe, update. This loop makes RL work, but it also prevents RL from being deployed. A self-driving system can&amp;rsquo;t practice intersections by crashing. A clinical decision-support model can&amp;rsquo;t run a randomized policy on real patients. A factory robot can&amp;rsquo;t test ten thousand grasp variants on a production line.&lt;/p>
&lt;p>These settings do have logs — millions of hours of human driving, decades of de-identified patient records, and terabytes of behavior cloning data. &lt;strong>Offline RL&lt;/strong> (also called &lt;em>batch RL&lt;/em>) is the subfield that asks: can we extract a strong policy from a fixed dataset without any new interaction with the environment?&lt;/p></description></item></channel></rss>