Goto

Collaborating Authors

 international conferenceon machine learning





AMaximum-Entropy Approachto Off-Policy Evaluationin Average-Reward MDPs

Neural Information Processing Systems

Howevb isnon-zero Similarlyr(s, ) are features: r(s, a)= (s, a)>w. Assumption A3(Featureexcitation)Forapolicy withstationarydistributiond (s, a), define =E(s,a) d [ (s, a) (s, a)>].




Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou

Neural Information Processing Systems

In with step-wise generally on-policy o with estimates model discrete a neural median Because trajectories T-step our IS/WIS T mo of20when re picks infinite map corresponding one T the iterations.