Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators

Dec-24-2025, 19:06:43 GMT–Neural Information Processing Systems

In TD-learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $\ell_p$-norm for each $p$ in $[1,\infty)$, with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios.

finite-sample analysis, generalized bellman operator, off-policy td-learning, (6 more...)

Neural Information Processing Systems

Dec-24-2025, 19:06:43 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.65)