Guaranteed Trust Region Optimization via Two-Phase KL Penalization

Zentner, K. R., Puri, Ujjwal, Huang, Zhehui, Sukhatme, Gaurav S.

arXiv.org Artificial Intelligence 

On-policy reinforcement learning (RL) methods seek to optimize a stochastic policy, where a neural network is used to parameterize a distribution π(a|s) over actions conditioned on the current state. In this framework, most on-policy RL methods seek to limit the scale of updates between successive policies during optimization. Some on-policy RL methods operate by guaranteeing that each policy update remains within a "trust region" (Schulman et al., 2015a). These methods are used when training stability during a long period of training is essential. However, finding a policy update near the edge of the trust region often comes at significant computational cost. Another branch of on-policy methods instead perform "proximal" policy updates, that limit the expected scale of policy updates, but can result in individual policy updates being of arbitrary magnitude (Schulman et al., 2017a). These methods are much more computationally efficient, but large-scale training can require the use of multiple training runs or human intervention to recover from training instabilities. In this work we propose Fixup Policy Optimization (FixPO), which combines both a proximal primary phase with a precise fixup phase, that operate by sharing a single penalty coefficient β. By performing a more conservative proximal update before strictly enforcing a trust region, FixPO is able to approximately match the computational efficiency and rewards of proximal methods while providing the same stability guarantees as trust region methods.