Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Oct-10-2024, 03:20:39 GMT–Neural Information Processing Systems

We consider the problem of offline reinforcement learning (RL) --- a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose \emph{Off-Policy Double Variance Reduction} (OPDVR), a new variance reduction-based algorithm for offline RL. Our main result shows that OPDVR provably identifies an \epsilon -optimal policy with \widetilde{O}(H 2/d_m\epsilon 2) episodes of offline data in the finite-horizon \emph{stationary transition} setting, where H is the horizon length and d_m is the minimal marginal state-action distribution induced by the behavior policy. This improves over the best-known upper bound by a factor of H .

double variance reduction, near-optimal offline reinforcement learning, sample complexity, (4 more...)

Neural Information Processing Systems

Oct-10-2024, 03:20:39 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (0.65)
  - Learning Graphical Models > Undirected Networks
    - Markov Models (0.43)