Bellman-consistent Pessimism for Offline Reinforcement Learning

Xie, Tengyang, Cheng, Ching-An, Jiang, Nan, Mineiro, Paul, Agarwal, Alekh

arXiv.org Machine Learning 

Using past experiences to learn improved behavior for future interactions is a critical capability for a Reinforcement Learning (RL) agent. However, robustly extrapolating knowledge from a historical dataset for sequential decision making is highly challenging, particularly in settings where function approximation is employed to generalize across related observations. In this paper, we provide a systematic treatment of such scenarios with general function approximation, and devise algorithms that can provably leverage an arbitrary historical dataset to discover the policy that obtains the largest guaranteed rewards, amongst all possible scenarios consistent with the dataset. The problem of learning a good policy from historical datasets, typically called batch or offline RL, has a long history [see e.g., Precup et al., 2000; Antos et al., 2008; Levine et al., 2020, and references therein]. Many prior works [e.g., Precup et al., 2000; Antos et al., 2008; Chen and Jiang, 2019] make the so-called coverage assumptions on the dataset, requiring the dataset to contain any possible state, action pair or trajectory with a lower bounded probability. These assumptions are evidently prohibitive in practice, particularly for problems with large state and/or action spaces. Furthermore, the methods developed under these assumptions routinely display unstable behaviors such as lack of convergence or error amplification, when coverage assumptions are violated [Wang et al., 2020, 2021]. Driven by these instabilities, a growing body of recent literature has pursued a so-called best effort style of guarantee instead.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found