Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
–Neural Information Processing Systems
We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policy µ. In particular, we consider the sample complexity problems of offline RL for finite-horizon MDPs. Prior works study this problem based on different data-coverage assumptions, and their learning guarantees are expressed by the covering coefficients which lack the explicit characterization of system quantities.
Neural Information Processing Systems
Apr-25-2026, 01:53:36 GMT