Variational Policy Gradient Method for Reinforcement Learning with General Utilities
–Neural Information Processing Systems
In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases.
Neural Information Processing Systems
Feb-8-2026, 00:04:46 GMT