Reinforcement Learning (5): Model-Based RL and World Models

Thu, 21 Aug 2025 09:00:00 +0000

Every algorithm we have covered so far — DQN, REINFORCE, A2C, PPO, SAC — is model-free: the agent treats the environment as a black box, throws actions at it, and updates its policy from the rewards that come back. The approach works, but it is profligate. DQN needs roughly 10 million frames to master Atari Pong. OpenAI Five trained on Dota 2 for the equivalent of ~45,000 years of self-play. AlphaStar consumed years of StarCraft for a single agent.

Dyna on Chen Kai Blog

Reinforcement Learning (5): Model-Based RL and World Models