Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

Son, Jaehyeon, Lee, Soochan, Kim, Gunhee

arXiv.org Artificial Intelligence 

Recent studies have shown that Transformers can perform in-context reinforcement learning (RL) by imitating existing RL algorithms, enabling sample-efficient adaptation to unseen tasks without parameter updates. However, these models also inherit the suboptimal behaviors of the RL algorithms they imitate. This issue primarily arises due to the gradual update rule employed by those algorithms. Model-based planning offers a promising solution to this limitation by allowing the models to simulate potential outcomes before taking action, providing an additional mechanism to deviate from the suboptimal behavior. Rather than learning a separate dynamics model, we propose Distillation for In-Context Planning (DICP), an in-context model-based RL framework where Transformers simultaneously learn environment dynamics and improve policy in-context. We evaluate DICP across a range of discrete and continuous environments, including Darkroom variants and Meta-World. Our results show that DICP achieves state-of-the-art performance while requiring significantly fewer environment interactions than baselines, which include both model-free counterparts and existing meta-RL methods. Since the introduction of Transformers (V aswani et al., 2017), their versatility in handling diverse tasks has been widely recognized across various domains (Brown et al., 2020; Dosovitskiy et al., 2021; Bubeck et al., 2023). A key aspect of their success is in-context learning (Brown et al., 2020), which enables models to acquire knowledge rapidly without explicit parameter updates through gradient descent. Recently, this capability has been explored in reinforcement learning (RL) (Chen et al., 2021; Schulman et al., 2017; Lee et al., 2022; Reed et al., 2022), where acquiring skills in a sample-efficient manner is crucial. This line of research naturally extends to meta-RL, which focuses on leveraging prior knowledge to quickly adapt to novel tasks. In this context, Laskin et al. (2023) introduce Algorithm Distillation (AD), an in-context RL approach where Transformers sequentially model the entire learning histories of a specific RL algorithm across various tasks. The goal is for the models to replicate the exploration-exploitation behaviors of the source RL algorithm, enabling them to tackle novel tasks purely in-context.