Dual RL: Unification and New Methods for Reinforcement and Imitation Learning
Sikchi, Harshit, Zheng, Qinqing, Zhang, Amy, Niekum, Scott
–arXiv.org Artificial Intelligence
The goal of reinforcement learning (RL) is to maximize the expected cumulative return. It has been shown that this objective can be represented by an optimization problem of the state-action visitation distribution under linear constraints [52]. The dual problem of this formulation, which we refer to as dual RL, is unconstrained and easier to optimize. We show that several state-of-the-art off-policy deep reinforcement learning (RL) algorithms, under both online and offline, RL and imitation learning (IL) settings, can be viewed as dual RL approaches in a unified framework. This unification provides a common ground to study and identify the components that contribute to the success of these methods and also reveals the common shortcomings across methods with new insights for improvement. Our analysis shows that prior off-policy imitation learning methods are based on an unrealistic coverage assumption and are minimizing a particular f-divergence between the visitation distributions of the learned policy and the expert policy. We propose a new method using a simple modification to the dual RL framework that allows for performant imitation learning with arbitrary off-policy data to obtain near-expert performance, without learning a discriminator. Further, by framing a recent SOTA offline RL method XQL [23] in the dual RL framework, we propose alternative choices to replace the Gumbel regression loss, which achieve improved performance and resolve the training instability issue of XQL. Project code and details can be found at this hari-sikchi.github.io/dual-rl.
arXiv.org Artificial Intelligence
Jun-22-2023
- Country:
- North America > United States
- Massachusetts (0.14)
- Texas (0.14)
- North America > United States
- Genre:
- Research Report > New Finding (0.46)
- Technology: