Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces

May-27-2025, 12:32:47 GMT–Neural Information Processing Systems

Direct optimization (McAllester et al., 2010; Song et al., 2016) is an appealing framework that replaces integration with optimization of a random objective for approximating gradients in models with discrete random variables (Lorberbom et al., 2018). A* sampling (Maddison et al., 2014) is a framework for optimizing such random objectives over large spaces. We show how to combine these techniques to yield a reinforcement learning algorithm that approximates a policy gradient by finding trajectories that optimize a random objective. We call the resulting algorithms \emph{direct policy gradient} (DirPG) algorithms. A main benefit of DirPG algorithms is that they allow the insertion of domain knowledge in the form of upper bounds on return-to-go at training time, like is used in heuristic search, while still directly computing a policy gradient.

artificial intelligence, machine learning, reinforcement learning, (7 more...)

Neural Information Processing Systems

May-27-2025, 12:32:47 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.62)