Reparameterization Proximal Policy Optimization
Zhong, Hai, Wang, Xun, Li, Zhuoran, Huang, Longbo
–arXiv.org Artificial Intelligence
Reparameterization policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics. However, a critical barrier is its training instability, where high-variance gradients can destabilize the learning process. To address this, we draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse in the model-free setting. We first establish a connection between this surrogate objective and RPG, which has been largely unexplored and is non-trivial. Then, we bridge this gap by demonstrating that the reparameterization gradient of a PPO-like surrogate objective can be computed efficiently using backpropagation through time. Based on this key insight, we propose Reparameterization Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method. RPO enables stable sample reuse over multiple epochs by employing a policy gradient clipping mechanism tailored for RPG. It is further stabilized by Kullback-Leibler (KL) divergence regularization and remains fully compatible with existing variance reduction methods. We evaluate RPO on a suite of challenging locomotion and manipulation tasks, where experiments demonstrate that our method achieves superior sample efficiency and strong performance.
arXiv.org Artificial Intelligence
Sep-26-2025
- Country:
- Genre:
- Research Report (0.50)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.66)
- Representation & Reasoning (1.00)
- Robots (1.00)
- Information Technology > Artificial Intelligence