AITopics | dualdice

Neural Information Processing Systems http://nips.cc/

algorithm, behavior policy, distribution correction, (14 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Add feedback

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Neural Information Processing SystemsDec-25-2025, 23:57:18 GMT

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset.

behavior-agnostic estimation, dualdice, stationary distribution correction, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Reviews: DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Neural Information Processing SystemsNov-18-2025, 22:16:52 GMT

NeurIPS 2019 Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center "1361" "DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections" Reviewer 1 Originality: I find this work to be original and the proposed algorithm to be novel. The authors clearly state what they contributions are and how their work differs itself from the prior works. Clarity/Quality: The paper is clearly written and is easy to follow, the authors do a great job stating the problem they consider, explaining existing solutions and their drawbacks, and then thoroughly building up the intuition behind their approach. Each theoretical step makes sense and is intuitive. I also appreciate the authors taking time to deriving their method using a simple convex function and then demonstrating that it is possible to extend the method to more general set of functions.

artificial intelligence, objective function, stationary distribution correction, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Neural Information Processing SystemsOct-10-2024, 23:15:27 GMT

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

behavior-agnostic estimation, dualdice, stationary distribution correction, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.44)

Add feedback

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Nachum, Ofir, Chow, Yinlam, Dai, Bo, Li, Lihong

Neural Information Processing SystemsMar-18-2020, 21:17:23 GMT

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

behavior-agnostic estimation, dualdice, stationary distribution correction, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.49)

Add feedback

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

Zhang, Shangtong, Liu, Bo, Whiteson, Shimon

arXiv.org Machine LearningJan-29-2020

We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. GradientDICE fixes several problems with GenDICE (Zhang et al., 2020), the current state-of-the-art for estimating such density ratios. Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced, so primal-dual algorithms are not guaranteed to find the desired solution. However, such nonlinearity is essential to ensure the consistency of GenDICE even with a tabular representation. This is a fundamental contradiction, resulting from GenDICE's original formulation of the optimization problem. In GradientDICE, we optimize a different objective from GenDICE by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence. Consequently, nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation.

gendice, gradientdice, nonlinearity, (13 more...)

arXiv.org Machine Learning

2001.11113

Country:

North America > Canada > Alberta (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Nachum, Ofir, Chow, Yinlam, Dai, Bo, Li, Lihong

arXiv.org Artificial IntelligenceJun-10-2019

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, it eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

1906.04733

Country: North America (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Add feedback

Filters

Collaborating Authors

dualdice

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Reviews: DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections