Transductive Off-policy Proximal Policy Optimization
Gan, Yaozhong, Yan, Renye, Tan, Xiaoyang, Wu, Zhe, Xing, Junliang
–arXiv.org Artificial Intelligence
Proximal Policy Optimization (PPO) is a popular model-free reinforcement learning algorithm, esteemed for its simplicity and efficacy. However, due to its inherent on-policy nature, its proficiency in harnessing data from disparate policies is constrained. This paper introduces a novel off-policy extension to the original PPO method, christened Transductive Off-policy PPO (ToPPO). Herein, we provide theoretical justification for incorporating off-policy data in PPO training and prudent guidelines for its safe application. Our contribution includes a novel formulation of the policy improvement lower bound for prospective policies derived from off-policy data, accompanied by a computationally efficient mechanism to optimize this bound, underpinned by assurances of monotonic improvement. Comprehensive experimental results across six representative tasks underscore ToPPO's promising performance.
arXiv.org Artificial Intelligence
Jun-6-2024
- Country:
- Asia > China
- Jiangsu Province > Nanjing (0.04)
- North America > United States (0.05)
- Asia > China
- Genre:
- Research Report (0.64)
- Technology: