Hofmann, Katja

Depth and nonlinearity induce implicit exploration for RL

arXiv.org Artificial Intelligence

The question of how to explore, i.e., take actions with uncertain outcomes to learn about possible future rewards, is a key question in reinforcement learning (RL). Here, we show a surprising result: We show that Q-learning with nonlinear Q-function and no explicit exploration (i.e., a purely greedy policy) can learn several standard benchmark tasks, including mountain car, equally well as, or better than, the most commonly-used $\epsilon$-greedy exploration. We carefully examine this result and show that both the depth of the Q-network and the type of nonlinearity are important to induce such deterministic exploration.

Meta Reinforcement Learning with Latent Variable Gaussian Processes

arXiv.org Machine Learning

Data efficiency, i.e., learning from small data sets, is critical in many practical applications where data collection is time consuming or expensive, e.g., robotics, animal experiments or drug design. Meta learning is one way to increase the data efficiency of learning algorithms by generalizing learned concepts from a set of training tasks to unseen, but related, tasks. Often, this relationship between tasks is hard coded or relies in some other way on human expertise. In this paper, we propose to automatically learn the relationship between tasks using a latent variable model. Our approach finds a variational posterior over tasks and averages over all plausible (according to this posterior) tasks when making predictions. We apply this framework within a model-based reinforcement learning setting for learning dynamics models and controllers of many related tasks. We apply our framework in a model-based reinforcement learning setting, and show that our model effectively generalizes to novel tasks, and that it reduces the average interaction time needed to solve tasks by up to 60% compared to strong baselines.

A Deep Learning Approach for Joint Video Frame and Reward Prediction in Atari Games

arXiv.org Artificial Intelligence

Reinforcement learning is concerned with identifying reward-maximizing behaviour policies in environments that are initially unknown. State-of-the-art reinforcement learning approaches, such as deep Q-networks, are model-free and learn to act effectively across a wide range of environments such as Atari games, but require huge amounts of data. Model-based techniques are more data-efficient, but need to acquire explicit knowledge about the environment. In this paper, we take a step towards using model-based techniques in environments with a high-dimensional visual state space by demonstrating that it is possible to learn system dynamics and the reward structure jointly. Our contribution is to extend a recently developed deep neural network for video frame prediction in Atari games to enable reward prediction as well. To this end, we phrase a joint optimization problem for minimizing both video frame and reward reconstruction loss, and adapt network parameters accordingly. Empirical evaluations on five Atari games demonstrate accurate cumulative reward prediction of up to 200 frames. We consider these results as opening up important directions for model-based reinforcement learning in complex, initially unknown environments.

Experimental and causal view on information integration in autonomous agents

arXiv.org Artificial Intelligence

The amount of digitally available but heterogeneous information about the world is remarkable, and new technologies such as self-driving cars, smart homes, or the internet of things may further increase it. In this paper we examine certain aspects of the problem of how such heterogeneous information can be harnessed by autonomous agents. After discussing potentials and limitations of some existing approaches, we investigate how \emph{experiments} can help to obtain a better understanding of the problem. Specifically, we present a simple agent that integrates video data from a different agent, and implement and evaluate a version of it on the novel experimentation platform \emph{Malmo}. The focus of a second investigation is on how information about the hardware of different agents, the agents' sensory data, and \emph{causal} information can be utilized for knowledge transfer between agents and subsequently more data-efficient decision making. Finally, we discuss potential future steps w.r.t.\ theory and experimentation, and formulate open questions.

Collective Noise Contrastive Estimation for Policy Transfer Learning

AAAI Conferences

We address the problem of learning behaviour policies to optimise online metrics from heterogeneous usage data. While online metrics, e.g., click-through rate, can be optimised effectively using exploration data, such data is costly to collect in practice, as it temporarily degrades the user experience. Leveraging related data sources to improve online performance would be extremely valuable, but is not possible using current approaches. We formulate this task as a policy transfer learning problem, and propose a first solution, called collective noise contrastive estimation (collective NCE). NCE is an efficient solution to approximating the gradient of a log-softmax objective. Our approach jointly optimises embeddings of heterogeneous data to transfer knowledge from the source domain to the target domain. We demonstrate the effectiveness of our approach by learning an effective policy for an online radio station jointly from user-generated playlists, and usage data collected in an exploration bucket.