Goto

Collaborating Authors

 Reinforcement Learning



Strategic Object Oriented Reinforcement Learning

arXiv.org Artificial Intelligence

Humans learn to play video games significantly faster than state-of-the-art reinforcement learning (RL) algorithms. Inspired by this, we introduce strategic object oriented reinforcement learning (SOORL) to learn simple dynamics model through automatic model selection and perform efficient planning with strategic exploration. We compare different exploration strategies in a model-based setting in which exact planning is impossible. Additionally, we test our approach on perhaps the hardest Atari game Pitfall! and achieve significantly improved exploration and performance over prior methods.


Learning a Prior over Intent via Meta-Inverse Reinforcement Learning

arXiv.org Machine Learning

A significant challenge for the practical application of reinforcement learning in the real world is the need to specify an oracle reward function that correctly defines a task. Inverse reinforcement learning (IRL) seeks to avoid this challenge by instead inferring a reward function from expert behavior. While appealing, it can be impractically expensive to collect datasets of demonstrations that cover the variation common in the real world (e.g. opening any type of door). Thus in practice, IRL must commonly be performed with only a limited set of demonstrations where it can be exceedingly difficult to unambiguously recover a reward function. In this work, we exploit the insight that demonstrations from other tasks can be used to constrain the set of possible reward functions by learning a "prior" that is specifically optimized for the ability to infer expressive reward functions from limited numbers of demonstrations. We demonstrate that our method can efficiently recover rewards from images for novel tasks and provide intuition as to how our approach is analogous to learning a prior.


Sequential Attacks on Agents for Long-Term Adversarial Goals

arXiv.org Machine Learning

Reinforcement learning (RL) has advanced greatly in the past few years with the employment of effective deep neural networks (DNNs) on the policy networks. With the great effectiveness came serious vulnerability issues with DNNs that small adversarial perturbations on the input can change the output of the network. Several works have pointed out that learned agents with a DNN policy network can be manipulated against achieving the original task through a sequence of small perturbations on the input states. In this paper, we demonstrate furthermore that it is also possible to impose an arbitrary adversarial reward on the victim policy network through a sequence of attacks. Our method involves the latest adversarial attack technique, Adversarial Transformer Network (ATN), that learns to generate the attack and is easy to integrate into the policy network. As a result of our attack, the victim agent is misguided to optimise for the adversarial reward over time. Our results expose serious security threats for RL applications in safety-critical systems including drones, medical analysis, and self-driving cars.


Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update

arXiv.org Machine Learning

We propose Episodic Backward Update - a new algorithm to boost the performance of a deep reinforcement learning agent by a fast reward propagation. In contrast to the conventional use of the experience replay with uniform random sampling, our agent samples a whole episode and successively propagates the value of a state to its previous states. Our computationally efficient recursive algorithm allows sparse and delayed rewards to propagate efficiently through all transitions of a sampled episode. We evaluate our algorithm on 2D MNIST Maze environment and 49 games of the Atari 2600 environment and show that our method improves sample efficiency with a competitive amount of computational cost.


Variational Inverse Control with Events: A General Framework for Data-Driven Reward Definition

arXiv.org Machine Learning

The design of a reward function often poses a major practical challenge to real-world applications of reinforcement learning. Approaches such as inverse reinforcement learning attempt to overcome this challenge, but require expert demonstrations, which can be difficult or expensive to obtain in practice. We propose variational inverse control with events (VICE), which generalizes inverse reinforcement learning methods to cases where full demonstrations are not needed, such as when only samples of desired goal states are available. Our method is grounded in an alternative perspective on control and reinforcement learning, where an agent's goal is to maximize the probability that one or more events will happen at some point in the future, rather than maximizing cumulative rewards. We demonstrate the effectiveness of our methods on continuous control tasks, with a focus on high-dimensional observations like images where rewards are hard or even impossible to specify.


r/MachineLearning - [D] Reinforcement learning measuring ground truth

@machinelearnbot

Analyzing the performance of the "ground truth" agent will vary in difficulty based on the task, in terms of a stochastic vs. deterministic environment, how obvious the reward function is (as simple as distance traveled, or more difficult like the score in Tetris), etc. A "ground truth" agent implies perfect performance which is extremely difficult to obtain for any but the most simple environments. If you are mainly just interested in looking at how to model an agent's behaviors, then the performance of the "ground truth" agent maybe won't matter. But if the performance of the "ground truth" agent does matter (it is part of an evolutionary process or something), then perhaps you could do something like record your own actions at the task (if doable), or compare the score to that of some baseline. Can you share more details about your project, like the environment you're using, what exactly you're trying to get out of it, what the project is for, etc.? This will help to get you a better answer.


Reinforcement Learning for Real Life Planning Problems

#artificialintelligence

To avoid the paper being thrown in the bin we provide this with a large, negative reward, say -1, and because the teacher is please with it being placed in the bin this nets a large positive reward, 1. To avoid the outcome where it continually gets passed around the room, we set the reward for all other actions to be a small, negative value, say -0.04. If we set this as a positive or null number then the model may let the paper go round and round as it would be better to gain small positives than risk getting close to the negative outcome. This number is also very small as it will only collect a single terminal reward but it could take many steps to end the episode and we need to ensure that, if the paper is place in the bin, the positive outcome is not cancelled out. Please note, the rewards are always relative to one another and I have chosen arbitrary figures but these can be changed if the results are not as desired.


Evaluating Reinforcement Learning Algorithms in Observational Health Settings

arXiv.org Machine Learning

Much attention has been devoted recently to the development of machine learning algorithms with the goal of improving treatment policies in healthcare. Reinforcement learning (RL) is a sub-field within machine learning that is concerned with learning how to make sequences of decisions so as to optimize long-term effects. Already, RL algorithms have been proposed to identify decision-making strategies for mechanical ventilation [Prasad et al., 2017], sepsis management [Raghu et al., 2017] and treatment of schizophrenia [Shortreed et al., 2011]. However, before implementing treatment policies learned by black-box algorithms in highstakes clinical decision problems, special care must be taken in the evaluation of these policies. Specifically, we focus on the observational setting, that is, the setting in which our RL algorithm has proposed some treatment policy, and we want to evaluate it based on historical data. This setting is common in healthcare applications, where we do not wish to experiment with patients' lives without evidence that the proposed treatment strategy may be better than current practice. While formal statistical methods have been developed to assess the quality of new policies based on observational data alone [Thomas and Brunskill, 2016, Precup et al., 2000, Pearl, 2009, Imbens and Rubin, 2015], these methods rely on strong assumptions and are limited by statistical properties. We do not attempt to summarize this vast literature in this work, rather, we aim to provide a conceptual starting point for clinical and computational researchers to ask the right questions when designing and evaluating algorithms for new ways of treating patients. In the following, we describe how choices about how to summarize a history, variance of statistical estimators, and confounders in more ad-hoc measures can result in unreliable, even misleading estimates of the quality of a treatment policy.


Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

arXiv.org Artificial Intelligence

Model-based reinforcement learning (RL) algorithms can attain excellent sample efficiency, but often lag behind the best model-free algorithms in terms of asymptotic performance, especially those with high-capacity parametric function approximators, such as deep networks. In this paper, we study how to bridge this gap, by employing uncertainty-aware dynamics models. We propose a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation. Our comparison to state-of-the-art model-based and model-free deep RL algorithms shows that our approach matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples (e.g. 25 and 125 times fewer samples than Soft Actor Critic and Proximal Policy Optimization respectively on the half-cheetah task).