Off-Policy Interval Estimation with Lipschitz Value Iteration
–Neural Information Processing Systems
Reinforcement learning (RL) (e.g., Sutton & Barto, 1998) has become widely used in tasks like Li, 2016; Liu et al., 2018a), estimating the expected reward of a target policy using observational data gathered from previous behavior policies, therefore holds tremendous promise for designing Our method is efficient and provably convergent. Our work is closely related to the off-policy point estimation.
Neural Information Processing Systems
Oct-3-2025, 00:03:19 GMT