PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

May-26-2025–arXiv.org Artificial Intelligence

PPO - BR establishes a new paradigm in adaptive RL by fusing exploration and convergence signals into a single bounded trust region -- a theoretically - grounded innovation (Theorem 1) that outperforms 5 SOTA baselines with <2% overhead (Fig 3). This work bridges a critical gap in phase - aware learning, enabling real - world deployment in safety - critical systems like robotic surgery (Appendix E) within a single theoretically - grounded trust region mechanism (Theorem 1), achieving 29.1% faster convergence: (1) Entropy - driven expansion (ϵ) promotes exploration in high - uncertainty states, while (2) reward - guided contraction (ϵ) enforces stability during convergence (Theorem 1). On 6 diverse benchmarks (MuJoCo/Atari/sparse - reward), PPO - BR achieves: 29.1% fa ster convergence (p < 0.001, Wilcoxon test), 2.3 lower reward variance vs PPO (Fig 3), and <1.8% runtime overhead with just 5 lines of code change (Algorithm 1). PPO - BR's plug - and - play simplicity and theoretical guarantees (Lemma 2) make it ready - to - deplo y in safety - critical systems -- from surgical robotics to autonomous drones -- where adaptive stability is non - negotiable . In contrast to recent methods such as Group Relative Policy Optimization (GRPO), PPO - BR offers a unified entropy - reward adaptive mechanism applicable to both language models and general reinforcement learning environments.

machine learning, natural language, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

May-26-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Indonesia (0.14)

Genre:
- Research Report (1.00)

Industry:
- Leisure & Entertainment > Games (0.46)
- Health & Medicine
  - Surgery (0.68)
  - Health Care Technology (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks > Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found