Simple Ingredients for Offline Reinforcement Learning
Cetin, Edoardo, Tirinzoni, Andrea, Pirotta, Matteo, Lazaric, Alessandro, Ollivier, Yann, Touati, Ahmed
–arXiv.org Artificial Intelligence
For instance, TD3+Behavior Cloning (TD3+BC, Fujimoto and Gu, 2021) achieves this by regularizing the actor loss with the divergence between the learned policy and the data-generating policy, while Advantage Weighted Actor Critic (AWAC, Nair et al., 2020) seeks a policy maximizing the data likelihood weighted by its exponentiated advantage function. Later extensions of AWAC also modify the critic loss to avoid querying actions outside the given data by learning a value function, e.g., by expectile regression in Implicit Q-learning (IQL, Kostrikov et al., 2022) and Gumbel regression in Extreme Q-learning (XQL, Garg et al., 2023). This class of methods can be easily integrated with online fine-tuning, even leading to several successful applications for real-world tasks (Lu et al., 2022; Nair et al., 2023). However, current offline RL methods still fail in simple settings. Hong et al. (2023c,b) showed that if the data contains many low-return and few high-return trajectories, policy constrained methods are unnecessarily conservative and fail to learn good behavior. Singh et al. (2023) report a similar effect on heteroskedastic datasets where the variability of behaviors differs across different regions of the state space.
arXiv.org Artificial Intelligence
Mar-19-2024