Goto

Collaborating Authors

 Reinforcement Learning


AutomaticDataAugmentationforGeneralizationin ReinforcementLearning

Neural Information Processing Systems

Generalization to new environments remains a major challenge in deep reinforcement learning (RL). Current methods fail to generalize to unseen environments even when trained on similar settings [19, 51, 71, 11, 21, 12, 60].



284afdc2309f9667d2d4fb9290235b0c-Paper-Conference.pdf

Neural Information Processing Systems

Theseoutcome-conditioned imitationlearningmethodsare appealing because of their simplicity, strong performance, and close ties with supervisedlearning.




Supplementary Material for " Variational Policy Gradient Method for Reinforcement Learning with General Utilities " A Related Work

Neural Information Processing Systems

We provide a more extension discussion for the context of this work. Firstly, when closed-form expressions for the optimizer of a function are unavailable, solving optimization problems requires iterative schemes such as gradient ascent [31]. Their convergence to global extrema is predicated on concavity and the tractability of computing ascent directions. When the objective takes the form of an expected value of a function parameterized by a random variable, stochastic approximations are required [36, 24]. The PG Theorem mentioned above gives a specific form for obtaining ascent directions with respect to a parameterized family of stationary policies via trajectories in a Markov decision process, when the objective is the expected cumulative return [44], which gives rise to the REINFORCE algorithm.


Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Neural Information Processing Systems

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases.


30de9ece7cf3790c8c39ccff1a044209-Paper.pdf

Neural Information Processing Systems

One difficulty in using artificial agents for human-assistive applications lies in the challenge of accurately assisting with a person's goal(s). Existing methods tend to rely on inferring the human's goal, which is challenging when there are many potential goals or when the set of candidate goals is difficult to identify. We propose a new paradigm for assistance by instead increasing thehuman's ability tocontroltheir environment, and formalize this approach byaugmenting reinforcement learning withhuman empowerment.