Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

Neural Information Processing Systems 

We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policy µ. In particular, we consider the sample complexity problems of offline RL for finite-horizon MDPs. Prior works study this problem based on different data-coverage assumptions, and their learning guarantees are expressed by the covering coefficients which lack the explicit characterization of system quantities.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found