PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

Rahman, Ben

arXiv.org Artificial Intelligence 

PPO - BR establishes a new paradigm in adaptive RL by fusing exploration and convergence signals into a single bounded trust region -- a theoretically - grounded innovation (Theorem 1) that outperforms 5 SOTA baselines with <2% overhead (Fig 3). This work bridges a critical gap in phase - aware learning, enabling real - world deployment in safety - critical systems like robotic surgery (Appendix E) within a single theoretically - grounded trust region mechanism (Theorem 1), achieving 29.1% faster convergence: (1) Entropy - driven expansion (ϵ) promotes exploration in high - uncertainty states, while (2) reward - guided contraction (ϵ) enforces stability during convergence (Theorem 1). On 6 diverse benchmarks (MuJoCo/Atari/sparse - reward), PPO - BR achieves: 29.1% fa ster convergence (p < 0.001, Wilcoxon test), 2.3 lower reward variance vs PPO (Fig 3), and <1.8% runtime overhead with just 5 lines of code change (Algorithm 1). PPO - BR's plug - and - play simplicity and theoretical guarantees (Lemma 2) make it ready - to - deplo y in safety - critical systems -- from surgical robotics to autonomous drones -- where adaptive stability is non - negotiable . In contrast to recent methods such as Group Relative Policy Optimization (GRPO), PPO - BR offers a unified entropy - reward adaptive mechanism applicable to both language models and general reinforcement learning environments.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found