Reinforcement Learning
IKEA Furniture Assembly Environment for Long-Horizon Complex Manipulation Tasks
Lee, Youngwoon, Hu, Edward S., Yang, Zhengyu, Yin, Alex, Lim, Joseph J.
The IKEA Furniture Assembly Environment is one of the first benchmarks for testing and accelerating the automation of complex manipulation tasks. The environment is designed to advance reinforcement learning from simple toy tasks to complex tasks requiring both long-term planning and sophisticated low-level control. Our environment supports over 80 different furniture models, Sawyer and Baxter robot simulation, and domain randomization. The IKEA Furniture Assembly Environment is a testbed for methods aiming to solve complex manipulation tasks. The environment is publicly available at https://clvrai.com/furniture
Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift
Islam, Riashat, Teru, Komal K., Sharma, Deepak
Off-policy deep reinforcement learning (RL) algorithms are incapable of learning solely from batch offline data without online interactions with the environment, due to the phenomenon known as \textit{extrapolation error}. This is often due to past data available in the replay buffer that may be quite different from the data distribution under the current policy. We argue that most off-policy learning methods fundamentally suffer from a \textit{state distribution shift} due to the mismatch between the state visitation distribution of the data collected by the behavior and target policies. This data distribution shift between current and past samples can significantly impact the performance of most modern off-policy based policy optimization algorithms. In this work, we first do a systematic analysis of state distribution mismatch in off-policy learning, and then develop a novel off-policy policy optimization method to constraint the state distribution shift. To do this, we first estimate the state distribution based on features of the state, using a density estimator and then develop a novel constrained off-policy gradient objective that minimizes the state distribution shift. Our experimental results on continuous control tasks show that minimizing this distribution mismatch can significantly improve performance in most popular practical off-policy policy gradient algorithms.
Working Memory Graphs
Loynd, Ricky, Fernandez, Roland, Celikyilmaz, Asli, Swaminathan, Adith, Hausknecht, Matthew
A BSTRACT Transformers have increasingly outperformed gated RNNs in obtaining new state-of-the-art results on supervised tasks involving text sequences. Inspired by this trend, we study the question of how Transformer-based models can improve the performance of sequential decision-making agents. We present the Working Memory Graph (WMG), an agent that employs multi-head self-attention to reason over a dynamic set of vectors representing observed and recurrent state. We evaluate WMG in two partially observable environments, one that requires complex reasoning over past observations, and another that features factored observations. We find that WMG significantly outperforms gated RNNs on these tasks, supporting the hypothesis that WMG's inductive bias in favor of learning and leveraging factored representations can dramatically boost sample efficiency in environments featuring such structure. In the RNN-based approach of Sutskever et al. (2014), an encoder RNN maps an input sentence to a series of internal hidden state vectors. The encoder's final hidden state is copied into a decoder RNN, which then generates another sequence of hidden states that determine the selection of output tokens in the target language. This model can be trained to translate sentences, but translation quality deteriorates on long sentences where long-term dependencies become critical.
Missingness as Stability: Understanding the Structure of Missingness in Longitudinal EHR data and its Impact on Reinforcement Learning in Healthcare
Fleming, Scott L., Jeyapragasan, Kuhan, Duan, Tony, Ding, Daisy, Gombar, Saurabh, Shah, Nigam, Brunskill, Emma
There is an emerging trend in the reinforcement learning for healthcare literature. In order to prepare longitudinal, irregularly sampled, cli nical datasets for reinforcement learning algorithms, many researchers will resa mple the time series data to short, regular intervals and use last-observation- carried-forward (LOCF) imputation to fill in these gaps. Typically, they will not mai ntain any explicit information about which values were imputed. In this work, w e (1) call attention to this practice and discuss its potential implication s; (2) propose an alternative representation of the patient state that addresses som e of these issues; and (3) demonstrate in a novel but representative clinical data set that our alternative representation yields consistently better results for ach ieving optimal control, as measured by off-policy policy evaluation, compared to repr esentations that do not incorporate missingness information.
On Value Discrepancy of Imitation Learning
Imitation learning trains a policy from expert demonstrations. Imitation learning approaches have been designed from various principles, such as behavioral cloning via supervised learning, apprenticeship learning via inverse reinforcement learning, and GAIL via generative adversarial learning. In this paper, we propose a framework to analyze the theoretical property of imitation learning approaches based on discrepancy propagation analysis. Under the infinite-horizon setting, the framework leads to the value discrepancy of behavioral cloning in an order of O((1-\gamma)^{-2}). We also show that the framework leads to the value discrepancy of GAIL in an order of O((1-\gamma)^{-1}). It implies that GAIL has less compounding errors than behavioral cloning, which is also verified empirically in this paper. To the best of our knowledge, we are the first one to analyze GAIL's performance theoretically. The above results indicate that the proposed framework is a general tool to analyze imitation learning approaches. We hope our theoretical results can provide insights for future improvements in imitation learning algorithms.
Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient
Luck, Kevin Sebastian, Vecerik, Mel, Stepputtis, Simon, Amor, Heni Ben, Scholz, Jonathan
Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient Kevin Sebastian Luck 1, Mel V ecerik 2, Simon Stepputtis 1, Heni Ben Amor 1 and Jonathan Scholz 2 Abstract -- Model-free reinforcement learning algorithms such as Deep Deterministic Policy Gradient (DDPG) often require additional exploration strategies, especially if the actor is of deterministic nature. This work evaluates the use of model-based trajectory optimization methods used for exploration in Deep Deterministic Policy Gradient when trained on a latent image embedding. In addition, an extension of DDPG is derived using a value function as critic, making use of a learned deep dynamics model to compute the policy gradient. This approach leads to a symbiotic relationship between the deep reinforcement learning algorithm and the latent trajectory optimizer . The trajectory optimizer benefits from the critic learned by the RL algorithm and the latter from the enhanced exploration generated by the planner . The developed methods are evaluated on two continuous control tasks, one in simulation and one in the real world. In particular, a Baxter robot is trained to perform an insertion task, while only receiving sparse rewards and images as observations from the environment. I NTRODUCTION Reinforcement learning (RL) methods enabled the development of autonomous systems that can autonomously learn and master a task when provided with an objective function. RL has been successfully applied to a wide range of tasks including flying [24], [17], manipulation [26], [9], [12], [3], [1], locomotion [10], [13], and even autonomous driving [6], [7].
Generalized Maximum Causal Entropy for Inverse Reinforcement Learning
Mai, Tien, Chan, Kennard, Jaillet, Patrick
We consider the problem of learning from demonstrated trajectories with inverse reinforcement learning (IRL). Motivated by a limitation of the classical maximum entropy model (Ziebart, Bagnell, and Dey 2010) in capturing the structure of the network of states, we propose an IRL model based on a generalized version of the causal entropy maximization problem, which allows us to generate a class of maximum entropy IRL models. Our generalized model has an advantage of being able to recover, in addition to a reward function, another expert's function that would (partially) capture the impact of the connecting structure of the states on experts' decisions. Empirical evaluation on a real-world dataset and a grid-world dataset shows that our generalized model outperforms the classical ones, in terms of recovering reward functions and demonstrated trajectories.
Six Degree-of-Freedom Hovering using LIDAR Altimetry via Reinforcement Meta-Learning
Gaudet, Brian, Linares, Richard, Furfaro, Roberto
We optimize a six degrees of freedom hovering policy using reinforcement meta-learning. The policy maps flash LIDAR measurements directly to on/off spacecraft body-frame thrust commands, allowing hovering at a fixed position and attitude in the asteroid body-fixed reference frame. Importantly, the policy does not require position and velocity estimates, and can operate in environments with unknown dynamics, and without an asteroid shape model or navigation aids. Indeed, during optimization the agent is confronted with a new randomly generated asteroid for each episode, insuring that it does not learn an asteroid's shape, texture, or environmental dynamics. This allows the deployed policy to generalize well to novel asteroid characteristics, which we demonstrate in our experiments. The hovering controller has the potential to simplify mission planning by allowing asteroid body-fixed hovering immediately upon the spacecraft's arrival to an asteroid. This in turn simplifies shape model generation and allows resource mapping via remote sensing immediately upon arrival at the target asteroid.
Inverse Reinforcement Learning with Missing Data
Mai, Tien, Nguyen, Quoc Phong, Low, Kian Hsiang, Jaillet, Patrick
We consider the problem of recovering an expert's reward function with inverse reinforcement learning (IRL) when there are missing/incomplete state-action pairs or observations in the demonstrated trajectories. This issue of missing trajectory data or information occurs in many situations, e.g., GPS signals from vehicles moving on a road network are intermittent. In this paper, we propose a tractable approach to directly compute the log-likelihood of demonstrated trajectories with incomplete/missing data. Our algorithm is efficient in handling a large number of missing segments in the demonstrated trajectories, as it performs the training with incomplete data by solving a sequence of systems of linear equations, and the number of such systems to be solved does not depend on the number of missing segments. Empirical evaluation on a real-world dataset shows that our training algorithm outperforms other conventional techniques.