Simple Policy Optimization

Jan-29-2024–arXiv.org Artificial Intelligence

PPO (Proximal Policy Optimization) algorithm has demonstrated excellent performance in many fields, and it is considered as a simple version of TRPO (Trust Region Policy Optimization) algorithm. However, the ratio clipping operation in PPO may not always effectively enforce the trust region constraints, this can be a potential factor affecting the stability of the algorithm. In this paper, we propose SPO (Simple Policy Optimization) algorithm, which introduces a novel clipping method for KL divergence between the old and current policies. SPO can effectively enforce the trust region constraints in almost all environments, while still maintaining the simplicity of a first-order algorithm. Comparative experiments in Atari 2600 environments show that SPO sometimes provides stronger performance than PPO. Code is available at https://github.com/MyRepositories-hub/Simple-Policy-Optimization.

algorithm, artificial intelligence, machine learning, (13 more...)

arXiv.org Artificial Intelligence

Jan-29-2024

arXiv.org PDF

Add feedback

Country:
- Asia (0.29)
- Europe > Austria
  - Vienna (0.14)

Genre:
- Research Report (0.50)

Industry:
- Leisure & Entertainment > Games (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Representation & Reasoning > Optimization (1.00)
  - Robots (1.00)