Safe and Efficient Off-Policy Reinforcement Learning

Remi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare

Apr-22-2026, 01:53:54 GMT–Neural Information Processing Systems

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(λ), which was an open problem since 1989. We illustrate the benefits of Retrace(λ) on a standard suite of Atari 2600 games. One fundamental trade-off in reinforcement learning lies in the definition of the update target: should one estimate Monte Carlo returns or bootstrap from an existing Q-function?

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Apr-22-2026, 01:53:54 GMT

Conferences PDF

Add feedback

Country:
- Europe (0.28)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Machine Learning > Reinforcement Learning (1.00)

Duplicate Docs Excel Report

Title
Safe and efficient off-policy reinforcement learning Rémi Munos
Safe and Efficient Off-Policy Reinforcement Learning

Similar Docs Excel Report more

Title	Similarity	Source
None found