Goto

Collaborating Authors

 mujoco environment




60cb558c40e4f18479664069d9642d5a-AuthorFeedback.pdf

Neural Information Processing Systems

We thank all the reviewers for the time and expertise invested in these reviews. A: We are sorry that some abuse of notations in the paper hinders the5 understanding ofourmethod. A: Such an assumption comes from an empirical41 observation that in robotics control problems, some key poses in different dynamics are still alike.





Directional-Clamp PPO

Karpel, Gilad, Zhou, Ruida, Sabach, Shoham, Ghavamzadeh, Mohammad

arXiv.org Artificial Intelligence

Proximal Policy Optimization (PPO) is widely regarded as one of the most successful deep reinforcement learning algorithms, known for its robustness and effectiveness across a range of problems. The PPO objective encourages the importance ratio between the current and behavior policies to move to the "right" direction -- starting from importance sampling ratios equal to 1, increasing the ratios for actions with positive advantages and decreasing those with negative advantages. A clipping function is introduced to prevent over-optimization when updating the importance ratio in these "right" direction regions. Many PPO variants have been proposed to extend its success, most of which modify the objective's behavior by altering the clipping in the "right" direction regions. However, due to randomness in the rollouts and stochasticity of the policy optimization, we observe that the ratios frequently move to the "wrong" direction during the PPO optimization. This is a key factor hindering the improvement of PPO, but it has been largely overlooked. To address this, we propose the Directional-Clamp PPO algorithm (DClamp-PPO), which further penalizes the actions going to the strict "wrong" direction regions, where the advantage is positive (negative) and importance ratio falls below (above) $1 - β$ ($1+β$), for a tunable parameter $β\in (0, 1)$. The penalty is by enforcing a steeper loss slope, i.e., a clamp, in those regions. We demonstrate that DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing on modifying the objective's behavior in the "right" direction, across various MuJoCo environments, using different random seeds. The proposed method is shown, both theoretically and empirically, to better avoid "wrong" direction updates while keeping the importance ratio closer to 1.


A Theoretical Derivations

Neural Information Processing Systems

An brief proof is provided as follows. Here, we describe certain implementation details of TEEN. For recurrent optimization mentioned in section 4.2, we set the period of We provide explicit parameters used in our algorithm in Table 1. For reproduction of TD3, we use the official implementation ( https://github.com/sfujim/TD3). Batch size 256 Discount ( γ) 0.99 Number of hidden layers 2 Number of hidden units per layer 256 Activation function ReLU Iterations per time step 1 Target smoothing coefficient ( η) 5 10 V ariance of target policy smoothing 0.2 Noise clip range [ 0.5, 0.5] Target critic update interval 2 16 C Additional Experimental Results The bolded line represents the average evaluation over 5 seeds.