Simple Policy Optimization
–arXiv.org Artificial Intelligence
PPO (Proximal Policy Optimization) algorithm has demonstrated excellent performance in many fields, and it is considered as a simple version of TRPO (Trust Region Policy Optimization) algorithm. However, the ratio clipping operation in PPO may not always effectively enforce the trust region constraints, this can be a potential factor affecting the stability of the algorithm. In this paper, we propose SPO (Simple Policy Optimization) algorithm, which introduces a novel clipping method for KL divergence between the old and current policies. SPO can effectively enforce the trust region constraints in almost all environments, while still maintaining the simplicity of a first-order algorithm. Comparative experiments in Atari 2600 environments show that SPO sometimes provides stronger performance than PPO. Code is available at https://github.com/MyRepositories-hub/Simple-Policy-Optimization.
arXiv.org Artificial Intelligence
Jan-29-2024
- Genre:
- Research Report (0.50)
- Industry:
- Leisure & Entertainment > Games (0.69)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Representation & Reasoning > Optimization (1.00)
- Robots (1.00)
- Information Technology > Artificial Intelligence