Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

Oct-9-2024, 18:05:40 GMT–Neural Information Processing Systems

We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy \mu . In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. Here \pi \star is a optimal policy, \mu is the behavior policy and d(s_h,a_h) is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results.

emph, instance-optimal offline reinforcement learning, offline rl, (6 more...)

Neural Information Processing Systems

Oct-9-2024, 18:05:40 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (0.86)
  - Learning Graphical Models > Undirected Networks
    - Markov Models (0.60)