Nearly Horizon-Free Offline Reinforcement Learning
–Neural Information Processing Systems
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with S states and A actions, or linear MDP with anchor points and feature dimension d, given the collected K episodes data with minimum visiting probability of (anchor) state-action pairs d_m, we obtain nearly horizon H -free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by 1. Specifically:• For offline policy evaluation, we obtain an \tilde{O}\left(\sqrt{\frac{1}{Kd_m}} \right) error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on \mathrm{poly}(H, S, A, d) in higher-order term.• For offline policy optimization, we obtain an \tilde{O}\left(\sqrt{\frac{1}{Kd_m}} \frac{\min(S, d)}{Kd_m}\right) sub-optimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by [Cui and Yang 2020] that has additional \mathrm{poly} (H, S, d) factors in the main term.To the best of our knowledge, these are the first set of nearly horizon-free bounds for episodic time-homogeneous offline tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a "total variance" term in the offline scenarios, which could be of individual interest.
Neural Information Processing Systems
May-26-2025, 22:54:40 GMT
- Technology: