AITopics | off-policy policy gradient theorem

Collaborating Authors

off-policy policy gradient theorem

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Ehsan Imani, Eric Graves, Martha White

Neural Information Processing SystemsFeb-12-2026, 16:32:08 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, emphatic weighting, gradient, (12 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.15)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Neural Information Processing SystemsNov-20-2025, 22:06:42 GMT

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm--called Actor Critic with Emphatic weightings (ACE)--that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods--particularly OffPAC and DPG--converge to the wrong solution whereas ACE finds the optimal solution.

emphatic weighting, name change, off-policy policy gradient theorem, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.78)

Add feedback

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Ehsan Imani, Eric Graves, Martha White

Neural Information Processing SystemsNov-20-2025, 15:54:22 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, reinforcement learning, (12 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.15)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Reviews: An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Neural Information Processing SystemsOct-7-2024, 08:52:30 GMT

The paper formulates and proves a policy gradient theorem in the off-policy setting. The derivation is based on emphatic weighting of the states. Based on the introduced theorem, an actor-critic algorithm, termed ACE, is further proposed. The algorithm requires computing policy gradient updates that depend on the emphatic weights. Computing low-variance estimates of the weights is non-trivial, and the authors introduce a relaxed version of the weights that interpolate between the off-policy actor-critic (Degris et al., 2012) and the unbiased (but high variance) estimator; the introduced estimator can be computed incrementally.

emphatic weighting, off-policy policy gradient theorem, review, (5 more...)

Neural Information Processing Systems

Genre: Summary/Review (0.61)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Imani, Ehsan, Graves, Eric, White, Martha

Neural Information Processing SystemsFeb-14-2020, 04:56:10 GMT

artificial intelligence, machine learning, off-policy policy gradient theorem, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.89)

Add feedback

A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Suttle, Wesley, Yang, Zhuoran, Zhang, Kaiqing, Wang, Zhaoran, Basar, Tamer, Liu, Ji

arXiv.org Machine LearningMar-17-2019

In this work we develop a new off-policy actor-critic algorithm that performs policy improvement with convergence guarantees in the multi-agent setting using function approximation. To achieve this, we extend the method of emphatic temporal differences (ETD(λ)) to the multi-agent setting with provable convergence under linear function approximation, and we also derive a novel off-policy policy gradient theorem for the multi-agent setting. Using these new results, we develop our two-timescale algorithm, which uses ETD(λ) to perform policy evaluation for the critic step at a faster timescale and policy gradient ascent using emphatic weightings for the actor step at a slower timescale. We also provide convergence guarantees for the actor step. Our work builds on recent advances in three main areas: multi-agent on-policy actor-critic methods, emphatic temporal difference learning for off-policy policy evaluation, and the use of emphatic weightings in off-policy policy gradient methods.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

1903.06372

Country:

North America > United States > New York > Suffolk County > Stony Brook (0.04)
North America > United States > Illinois (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Imani, Ehsan, Graves, Eric, White, Martha

Neural Information Processing SystemsDec-31-2018

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm—called Actor Critic with Emphatic weightings (ACE)—that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods—particularly OffPAC and DPG—converge to the wrong solution whereas ACE finds the optimal solution.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.15)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Imani, Ehsan, Graves, Eric, White, Martha

arXiv.org Machine LearningNov-21-2018

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of $emphatic$ $weightings$. We develop a new actor-critic algorithm$\unicode{x2014}$called Actor Critic with Emphatic weightings (ACE)$\unicode{x2014}$that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods$\unicode{x2014}$particularly OffPAC and DPG$\unicode{x2014}$converge to the wrong solution whereas ACE finds the optimal solution.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

arXiv.org Machine Learning

1811.09013

Country:

North America > Canada > Alberta (0.14)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback