An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Imani, Ehsan, Graves, Eric, White, Martha

Feb-14-2020, 04:56:10 GMT–Neural Information Processing Systems

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings.

emphatic weighting, gradient method, off-policy policy gradient theorem, (3 more...)

Neural Information Processing Systems

Feb-14-2020, 04:56:10 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.89)