Off-Policy Interval Estimation with Lipschitz Value Iteration

Neural Information Processing Systems 

Reinforcement learning (RL) (e.g., Sutton & Barto, 1998) has become widely used in tasks like Li, 2016; Liu et al., 2018a), estimating the expected reward of a target policy using observational data gathered from previous behavior policies, therefore holds tremendous promise for designing Our method is efficient and provably convergent. Our work is closely related to the off-policy point estimation.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found