Finite-SampleAnalysisofOff-PolicyTD-Learningvia GeneralizedBellmanOperators
–Neural Information Processing Systems
Itisknown that policyevaluation has the interpretation of solving ageneralized Bellman equation. Inthispaper,wederivefinite-sample bounds foranygeneral off-policy TD-like stochastic approximation algorithm that solves for the fixedpoint of this generalized Bellman operator.
Neural Information Processing Systems
Feb-10-2026, 19:20:21 GMT