Reinforcement Learning (10): Offline Reinforcement Learning

Mon, 15 Sep 2025 09:00:00 +0000

Every algorithm we’ve studied so far has the same core loop: act, observe, update. This loop makes RL work, but it also prevents RL from being deployed. A self-driving system can’t practice intersections by crashing. A clinical decision-support model can’t run a randomized policy on real patients. A factory robot can’t test ten thousand grasp variants on a production line.

These settings do have logs — millions of hours of human driving, decades of de-identified patient records, and terabytes of behavior cloning data. Offline RL (also called batch RL) is the subfield that asks: can we extract a strong policy from a fixed dataset without any new interaction with the environment?

CQL on Chen Kai Blog

Reinforcement Learning (10): Offline Reinforcement Learning