Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Zhang, Michael R., Paine, Tom Le, Nachum, Ofir, Paduraru, Cosmin, Tucker, George, Wang, Ziyu, Norouzi, Mohammad

Apr-28-2021–arXiv.org Artificial Intelligence

Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we challenge this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning. Model-based Reinforcement Learning (RL) aims to learn an approximate model of the environment's dynamics from existing logged interactions to facilitate efficient policy evaluation and optimization. Early work on Model-based RL uses simple tabular (Sutton, 1990; Moore and Atkeson, 1993; Peng and Williams, 1993) and locally linear (Atkeson et al., 1997) dynamics models, which often result in a large degree of model bias (Deisenroth and Rasmussen, 2011). Recent work adopts feedforward neural networks to model complex transition dynamics and improve generalization to unseen states and actions, achieving a high level of performance on standard RL benchmarks (Chua et al., 2018; Wang et al., 2019).

artificial intelligence, dynamic model, neural network, (18 more...)

arXiv.org Artificial Intelligence

Apr-28-2021

arXiv.org PDF

Add feedback

Country:
- North America > Canada
  - Alberta (0.14)
  - Ontario > Toronto (0.14)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks (1.00)
    - Reinforcement Learning (1.00)
  - Representation & Reasoning > Agents (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found