The Role of Coverage in Online Reinforcement Learning

Xie, Tengyang, Foster, Dylan J., Bai, Yu, Jiang, Nan, Kakade, Sham M.

arXiv.org Artificial Intelligence 

The last decade has seen development of reinforcement learning algorithms with strong empirical performance in domains including robotics (Kober et al., 2013; Lillicrap et al., 2015), dialogue systems (Li et al., 2016), and personalization (Agarwal et al., 2016; Tewari and Murphy, 2017). While there is great interest in applying these techniques to real-world decision making applications, the number of samples (steps of interaction) required to do so is often prohibitive, with state-of-the-art algorithms requiring millions of samples to reach human-level performance in challenging domains. Developing algorithms with improved sample efficiency, which entails efficiently generalizing across high-dimensional states and actions while taking advantage of problem structure as modeled practitioners, remains a major challenge. Investigation into design and analysis of algorithms for sample-efficient reinforcement learning has largely focused on two distinct problem formulations: Online reinforcement learning, where the learner can repeatedly interact with the environment by executing a policy and observing the resulting trajectory. Offline reinforcement learning, where the learner has access to logged transitions ands reward gathered from a fixed behavioral policy (e.g., historical data or expert demonstrations), but cannot directly interact with the underlying environment. While these formulations share a common goal (learning a near-optimal policy), the algorithms used to achieve this goal and conditions under which it can be achieved are seemingly quite different.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found