Neural PPO-Clip Attains Global Optimality: A Hinge Loss Perspective
Huang, Nai-Chieh, Hsieh, Ping-Chun, Ho, Kuo-Hao, Yao, Hsuan-Yu, Hu, Kai-Chun, Ouyang, Liang-Chun, Wu, I-Chen
–arXiv.org Artificial Intelligence
Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-Clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-Clip has not been justified via theoretical proof up to date. In this paper, we establish the first global convergence rate of PPO-Clip under neural function approximation. We identify the fundamental challenges of analyzing PPO-Clip and address them with the two core ideas: (i) We reinterpret PPO-Clip from the perspective of hinge loss, which connects policy improvement with solving a large-margin classification problem with hinge loss and offers a generalized version of the PPO-Clip objective. (ii) Based on the above viewpoint, we propose a two-step policy improvement scheme, which facilitates the convergence analysis by decoupling policy search from the complex neural policy parameterization with the help of entropic mirror descent and a regression-based policy update scheme. Moreover, our theoretical results provide the first characterization of the effect of the clipping mechanism on the convergence of PPO-Clip. Through experiments, we empirically validate the reinterpretation of PPO-Clip and the generalized objective with various classifiers on various RL benchmark tasks.
arXiv.org Artificial Intelligence
Aug-31-2022
- Country:
- North America > United States (0.04)
- Asia > Taiwan
- Taiwan Province > Taipei (0.04)
- Genre:
- Research Report (0.50)
- Technology: