Goto

Collaborating Authors

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Neural Information Processing Systems

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm--called Actor Critic with Emphatic weightings (ACE)--that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods--particularly OffPAC and DPG--converge to the wrong solution whereas ACE finds the optimal solution.


An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Neural Information Processing Systems

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm—called Actor Critic with Emphatic weightings (ACE)—that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods—particularly OffPAC and DPG—converge to the wrong solution whereas ACE finds the optimal solution.


Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation

arXiv.org Artificial Intelligence

We present the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge under off-policy training. Our solution targets problems in reinforcement learning where the action representation adds to the-curse-of-dimensionality; that is, with continuous or large action sets, thus making it infeasible to estimate state-action value functions (Q functions). Using state-value functions helps to lift the curse and as a result naturally turn our policy-gradient solution into classical Actor-Critic architecture whose Actor uses state-value function for the update. Our algorithms, Gradient Actor-Critic and Emphatic Actor-Critic, are derived based on the exact gradient of averaged state-value function objective and thus are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical Actor-Critic methods with no additional hyper-parameters. To our knowledge, this is the first time that convergent off-policy learning methods have been extended to classical Actor-Critic methods with function approximation.


Generalized Off-Policy Actor-Critic

arXiv.org Artificial Intelligence

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.


Variance-Reduced Off-Policy Memory-Efficient Policy Search

arXiv.org Machine Learning

Off-policy policy optimization is a challenging problem in reinforcement learning (RL). The algorithms designed for this problem often suffer from high variance in their estimators, which results in poor sample efficiency, and have issues with convergence. A few variance-reduced on-policy policy gradient algorithms have been recently proposed that use methods from stochastic optimization to reduce the variance of the gradient estimate in the REINFORCE algorithm. However, these algorithms are not designed for the off-policy setting and are memory-inefficient, since they need to collect and store a large ``reference'' batch of samples from time to time. To achieve variance-reduced off-policy-stable policy optimization, we propose an algorithm family that is memory-efficient, stochastically variance-reduced, and capable of learning from off-policy samples. Empirical studies validate the effectiveness of the proposed approaches.