Supplementary Material for " Variational Policy Gradient Method for Reinforcement Learning with General Utilities " A Related Work

Feb-8-2026, 00:04:53 GMT–Neural Information Processing Systems

We provide a more extension discussion for the context of this work. Firstly, when closed-form expressions for the optimizer of a function are unavailable, solving optimization problems requires iterative schemes such as gradient ascent [31]. Their convergence to global extrema is predicated on concavity and the tractability of computing ascent directions. When the objective takes the form of an expected value of a function parameterized by a random variable, stochastic approximations are required [36, 24]. The PG Theorem mentioned above gives a specific form for obtaining ascent directions with respect to a parameterized family of stationary policies via trajectories in a Markov decision process, when the objective is the expected cumulative return [44], which gives rise to the REINFORCE algorithm.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Feb-8-2026, 00:04:53 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.64)
  - Representation & Reasoning > Optimization (0.48)

Duplicate Docs Excel Report

Title
30ee748d38e21392de740e2f9dc686b6-Supplemental.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found