Reviews: Safe and Efficient Off-Policy Reinforcement Learning
–Neural Information Processing Systems
In particular, it bounds the performance of off-policy importance sampling as a function of a truncation coefficient, and discusses how to choose that coefficient based on the bound they propose. The lack of a discussion of the relationship to that work makes paper 602 considerably weaker in my opinion. I would still lean towards acceptance, but only as a poster. Analyzing the convergence of the general-form off-policy updates in Equation 4 is novel and important. The theory is limited to finite state spaces (something that should be stated in the abstract) for discounted MDPs, but the empirical results show that the new Retrace algorithm can perform well in conjunction with value function approximation.
Neural Information Processing Systems
Jan-20-2025, 19:06:13 GMT
- Technology: