Goto

Collaborating Authors

 geppo




Appendix A Checklist 1. For all authors (a)

Neural Information Processing Systems

Do the main claims made in the abstract and introduction accurately reflect the paper's Did you include complete proofs of all theoretical results? Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Y es] Code available Did you specify all the training details (e.g., data splits, hyperparameters, how they Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Did you include the total amount of compute and the type of resources used (e.g., type If your work uses existing assets, did you cite the creators? Did you include any new assets either in the supplemental material or as a URL? [Y es] Did you discuss whether and how consent was obtained from people whose data you're If you used crowdsourcing or conducted research with human subjects... (a) The proof is similar to the proof of Lemma 1 in Achiam et al. In Section 7, we demonstrated the benefits of our algorithm with uniform policy weights, i.e., Non-uniform policy weights introduce an additional source of variance, so in order to account for this we must extend the notion of sample size to effective sample size.



Generalized Proximal Policy Optimization with Sample Reuse

Queeney, James, Paschalidis, Ioannis Ch., Cassandras, Christos G.

arXiv.org Artificial Intelligence

In real-world decision making tasks, it is critical for data-driven reinforcement learning methods to be both stable and sample efficient. On-policy methods typically generate reliable policy improvement throughout training, while off-policy methods make more efficient use of data through sample reuse. In this work, we combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms. We develop policy improvement guarantees that are suitable for the off-policy setting, and connect these bounds to the clipping mechanism used in Proximal Policy Optimization. This motivates an off-policy version of the popular algorithm that we call Generalized Proximal Policy Optimization with Sample Reuse. We demonstrate both theoretically and empirically that our algorithm delivers improved performance by effectively balancing the competing goals of stability and sample efficiency.