Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

Oct-11-2024, 00:06:15 GMT–Neural Information Processing Systems

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE \sup_\Pi Q \pi-\hat{Q} \pi \epsilon is a stronger measure than the point-wise OPE and ensures offline learning when \Pi contains all policies (the global class). In this paper, we establish an \Omega(H 2 S/d_m\epsilon 2) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of \tilde{O}(H 2/d_m\epsilon 2) for the \emph{local} uniform convergence that applies to all \emph{near-empirically optimal} policies for the MDPs with \emph{stationary} transition. Here d_m is the minimal marginal state-action probability. Critically, the highlight in achieving the optimal rate \tilde{O}(H 2/d_m\epsilon 2) is our design of \emph{singleton absorbing MDP}, which is a new sharp analysis tool that works with the model-based approach.

model-based offline reinforcement learning, optimal uniform ope, reward-free and task-agnostic, (5 more...)

Neural Information Processing Systems

Oct-11-2024, 00:06:15 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)