Goto

Collaborating Authors

 Reinforcement Learning


[Perspective] Why does time seem to fly when we're having fun?

Science

Animals use the neurotransmitter dopamine to encode the relationship between their responses and reward. Reinforcement learning theory (1) successfully explains the role of phasic bursts of dopamine in terms of future reward maximization. Yet, dopamine clearly plays other roles in shaping behavior that have no obvious relationship to reinforcement learning, including modulating the rate at which our subjective sense of time grows in real time. On page 1273 of this issue, Soares et al. (2) closely examine the role of dopamine in mice performing a task in which they keep track of the time between two events and make decisions about this temporal duration. The results suggest the need to reassess the leading theory of dopamine function in timing--the dopamine clock hypothesis (3). They may also help explain empirical phenomena that challenge the reinforcement learning account of dopamine function.


Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning

arXiv.org Machine Learning

We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration.


Model-based Adversarial Imitation Learning

arXiv.org Machine Learning

Generative adversarial learning is a popular new approach to training generative models which has been proven successful for other related problems as well. The general idea is to maintain an oracle $D$ that discriminates between the expert's data distribution and that of the generative model $G$. The generative model is trained to capture the expert's distribution by maximizing the probability of $D$ misclassifying the data it generates. Overall, the system is \emph{differentiable} end-to-end and is trained using basic backpropagation. This type of learning was successfully applied to the problem of policy imitation in a model-free setup. However, a model-free approach does not allow the system to be differentiable, which requires the use of high-variance gradient estimations. In this paper we introduce the Model based Adversarial Imitation Learning (MAIL) algorithm. A model-based approach for the problem of adversarial imitation learning. We show how to use a forward model to make the system fully differentiable, which enables us to train policies using the (stochastic) gradient of $D$. Moreover, our approach requires relatively few environment interactions, and fewer hyper-parameters to tune. We test our method on the MuJoCo physics simulator and report initial results that surpass the current state-of-the-art.


Elon Musk-backed OpenAI reveals Universe โ€“ a universal training ground for computers

#artificialintelligence

Hoping to teach AI agents the common sense they need to solve arbitrary tasks without specific training, OpenAI on Monday will introduce Universe, a collection of virtualized video games, browser interfaces, and applications that serve as a training ground for code-based decision making. Universe is open-source middleware that supports Gym, the organization's toolkit for developing and evaluating reinforcement learning (RL) algorithms. RL is used to train software perform specific actions, such as playing a videogame or making a 3D model walk, under a framework that prioritizes actions through a reward scheme. Universe aims to accelerate the education of AI agents by broadening the number of available training resources. Previously, according to OpenAI, the largest RL resource consisted of 55 Atari games, the Atari Learning Environment.



Project Malmo: Enabling AI technology that can collaborate with humans - Microsoft Research

#artificialintelligence

Project Malmo, a platform that uses the world of Minecraft as a testing ground for advanced artificial intelligence research and innovation, is available for novice to experienced programmers on GitHub via an open-source license. The system is primarily designed to help researchers develop sophisticated AI that can do things like learn, converse, make decisions and complete complex tasks. It supports research on a range of methods such as reinforcement learning, deep learning and symbolic AI, allowing researchers to compare and integrate different approaches to advance AI understanding, reasoning, learning and communications. Project Malmo is available at aka.ms/github-malmo


Transfer Learning Across Patient Variations with Hidden Parameter Markov Decision Processes

arXiv.org Machine Learning

Due to physiological variation, patients diagnosed with the same condition may exhibit divergent, but related, responses to the same treatments. Hidden Parameter Markov Decision Processes (HiP-MDPs) tackle this transfer-learning problem by embedding these tasks into a low-dimensional space. However, the original formulation of HiP-MDP had a critical flaw: the embedding uncertainty was modelled independently of the agent's state uncertainty, requiring an unnatural training procedure in which all tasks visited every part of the state space--possible for robots that can be moved to a particular location, impossible for human patients. We update the HiP-MDP framework and extend it to more robustly develop personalized medicine strategies for HIV treatment.


Playing Doom with SLAM-Augmented Deep Reinforcement Learning

arXiv.org Machine Learning

A number of recent approaches to policy learning in 2D game domains have been successful going directly from raw input images to actions. However when employed in complex 3D environments, they typically suffer from challenges related to partial observability, combinatorial exploration spaces, path planning, and a scarcity of rewarding scenarios. Inspired from prior work in human cognition that indicates how humans employ a variety of semantic concepts and abstractions (object categories, localisation, etc.) to reason about the world, we build an agent-model that incorporates such abstractions into its policy-learning framework. We augment the raw image input to a Deep Q-Learning Network (DQN), by adding details of objects and structural elements encountered, along with the agent's localisation. The different components are automatically extracted and composed into a topological representation using on-the-fly object detection and 3D-scene reconstruction. We evaluate the efficacy of our approach in "Doom", a 3D first-person combat game that exhibits a number of challenges discussed, and show that our augmented framework consistently learns better, more effective policies.


Contextual Decision Processes with Low Bellman Rank are PAC-Learnable

arXiv.org Machine Learning

We introduce a new model called contextual decision processes, that unifies and generalizes most prior settings. Our first contribution is a complexity measure, the Bellman rank, that we show enables tractable learning of near-optimal behavior in these processes and is naturally small for many well-studied reinforcement learning settings. Our second contribution is a new reinforcement learning algorithm that engages in systematic exploration to learn contextual decision processes with low Bellman rank. Our algorithm provably learns near-optimal behavior with a number of samples that is polynomial in all relevant parameters but independent of the number of unique observations. The approach uses Bellman error minimization with optimistic exploration and provides new insights into efficient exploration for reinforcement learning with function approximation.


A Deep Hierarchical Approach to Lifelong Learning in Minecraft

arXiv.org Artificial Intelligence

We propose a lifelong learning system that has the ability to reuse and transfer knowledge from one task to another while efficiently retaining the previously learned knowledgebase. Knowledge is transferred by learning reusable skills to solve tasks in Minecraft, a popular video game which is an unsolved and high-dimensional lifelong learning problem. These reusable skills, which we refer to as Deep Skill Networks, are then incorporated into our novel Hierarchical Deep Reinforcement Learning Network (H-DRLN) architecture using two techniques: (1) a deep skill array and (2) skill distillation, our novel variation of policy distillation (Rusu et al. 2015) for learning skills. Skill distillation enables the H-DRLN to efficiently retain knowledge and therefore scale in lifelong learning, by accumulating knowledge and encapsulating multiple reusable skills into a single distilled network. The H-DRLN exhibits superior performance and lower learning sample complexity compared to the regular Deep Q Network (Mnih et al. 2015) in sub-domains of Minecraft.