Designing an efficient and equitable humanitarian supply chain dynamically via reinforcement learning
–arXiv.org Artificial Intelligence
Specifically, it is a policy gradient method, often used for deep learning when the policy network is very large. The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015 by Schulman et al . It addressed the instability issue of another algorithm, the Deep Q - Network (DQN).
arXiv.org Artificial Intelligence
May-26-2025