Goto

Collaborating Authors

 Reinforcement Learning


BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems

arXiv.org Machine Learning

We present a new algorithm that significantly improves the efficiency of exploration for deep Q-learning agents in dialogue systems. Our agents explore via Thompson sampling, drawing Monte Carlo samples from a Bayes-by-Backprop neural network. Our algorithm learns much faster than common exploration strategies such as $\epsilon$-greedy, Boltzmann, bootstrapping, and intrinsic-reward-based ones. Additionally, we show that spiking the replay buffer with experiences from just a few successful episodes can make Q-learning feasible when it might otherwise fail.


Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning

arXiv.org Artificial Intelligence

The problem of sparse rewards is one of the hardest challenges in contemporary reinforcement learning. Hierarchical reinforcement learning (HRL) tackles this problem by using a set of temporally-extended actions, or options, each of which has its own subgoal. These subgoals are normally handcrafted for specific tasks. Here, though, we introduce a generic class of subgoals with broad applicability in the visual domain. Underlying our approach (in common with work using "auxiliary tasks") is the hypothesis that the ability to control aspects of the environment is an inherently useful skill to have. We incorporate such subgoals in an end-to-end hierarchical reinforcement learning system and test two variants of our algorithm on a number of games from the Atari suite. We highlight the advantage of our approach in one of the hardest games -- Montezuma's revenge -- for which the ability to handle sparse rewards is key. Our agent learns several times faster than the current state-of-the-art HRL agent in this game, reaching a similar level of performance. UPDATE 22/11/17: We found that a standard A3C agent with a simple shaped reward, i.e. extrinsic reward + feature control intrinsic reward, has comparable performance to our agent in Montezuma Revenge. In light of the new experiments performed, the advantage of our HRL approach can be attributed more to its ability to learn useful features from intrinsic rewards rather than its ability to explore and reuse abstracted skills with hierarchical components. This has led us to a new conclusion about the result.


Deep Q-learning from Demonstrations

arXiv.org Artificial Intelligence

Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator's actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD's performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.


Bridging the Gap Between Value and Policy Based Reinforcement Learning

arXiv.org Artificial Intelligence

We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces. We examine the behavior of PCL in different scenarios and show that PCL can be interpreted as generalizing both actor-critic and Q-learning algorithms. We subsequently deepen the relationship by showing how a single model can be used to represent both a policy and the corresponding softmax state values, eliminating the need for a separate critic. The experimental evaluation demonstrates that PCL significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.


Transferring Agent Behaviors from Videos via Motion GANs

arXiv.org Machine Learning

A major bottleneck for developing general reinforcement learning agents is determining rewards that will yield desirable behaviors under various circumstances. We introduce a general mechanism for automatically specifying meaningful behaviors from raw pixels. In particular, we train a generative adversarial network to produce short sub-goals represented through motion templates. We demonstrate that this approach generates visually meaningful behaviors in unknown environments with novel agents and describe how these motions can be used to train reinforcement learning agents.


Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

arXiv.org Machine Learning

In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no practical methods exist for determining high-confidence policy performance bounds in the inverse reinforcement learning setting---where the true reward function is unknown and only samples of expert behavior are given. We propose a sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the $\alpha$-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert's unknown reward function. We evaluate our proposed bound on both a standard grid navigation task and a simulated driving task and achieve tighter and more accurate bounds than a feature count-based baseline. We also give examples of how our proposed bound can be utilized to perform risk-aware policy selection and risk-aware policy improvement. Because our proposed bound requires several orders of magnitude fewer demonstrations than existing high-confidence bounds, it is the first practical method that allows agents that learn from demonstration to express confidence in the quality of their learned policy.


Inverse Risk-Sensitive Reinforcement Learning

arXiv.org Machine Learning

HE modeling and learning of human decision-making behavior is increasingly becoming important as critical systems begin to rely more on automation and artificial intelligence. Yet, in this task we face a number of challenges, not least of which is the fact that humans are known to behave in ways that are not completely rational. There is mounting evidence to support the fact that humans often use reference points--e.g., the status quo or former experiences or recent expectaions about the future that are otherwise perceived to be related to the decision the human is making [1], [2]. It has also been observed that their decisions are impacted by their perception of the external world (exogenous factors) and their present state of mind (endogenous factors) as well as how the decision is framed or presented [3]. The success of descriptive behavioral models in capturing human behavior has long been touted by the psychology community and, more recently, by the economics community. In the engineering context, humans have largely been modeled, under rationality assumptions, from the so-called normative point of view where things are modeled as they ought to be, which is counter to a descriptive as is point of view. However, risk-sensitivity in the context of learning to control stochastic dynamical systems (see, e.g., [4], [5]) has been fairly extensively explored in engineering. Many of these approaches are targeted at mitigating risks due to uncertainties in controlling a system such as a plant or robot where risk-aversion is captured by leveraging techniques such as exponential utility functions or minimizing mean-variance criteria. Complex risk-sensitive behavior arising from human interaction with automation is only recently coming into focus.


Classification with Costly Features using Deep Reinforcement Learning

arXiv.org Machine Learning

We study a classification problem where each feature can be acquired for a cost and the goal is to optimize the trade-off between classification precision and the total feature cost. We frame the problem as a sequential decision-making problem, where we classify one sample in each episode. At each step, an agent can use values of acquired features to decide whether to purchase another one or whether to classify the sample. We use vanilla Double Deep Q-learning, a standard reinforcement learning technique, to find a classification policy. We show that this generic approach outperforms Adapt-Gbrt, currently the best-performing algorithm developed specifically for classification with costly features.


Teaching a Machine to Read Maps with Deep Reinforcement Learning

arXiv.org Machine Learning

The ability to use a 2D map to navigate a complex 3D environment is quite remarkable, and even difficult for many humans. Localization and navigation is also an important problem in domains such as robotics, and has recently become a focus of the deep reinforcement learning community. In this paper we teach a reinforcement learning agent to read a map in order to find the shortest way out of a random maze it has never seen before. Our system combines several state-of-the-art methods such as A3C and incorporates novel elements such as a recurrent localization cell. Our agent learns to localize itself based on 3D first person images and an approximate orientation angle. The agent generalizes well to bigger mazes, showing that it learned useful localization and navigation capabilities.


Under the Hood with Reinforcement Learning – Understanding Basic RL Models

@machinelearnbot

Summary: Reinforcement Learning (RL) is likely to be the next big push in artificial intelligence. But the concept of modeling in RL is very different from our statistical techniques and deep learning. In this two part series we'll take a look at the basics of RL models, how they're built and used. In the next part, we'll address some of the complexities that make development a challenge. Now that we have pretty much conquered speech, text, and image processing with deep neural nets, it's time to turn our attention to what comes next. It's likely that the next most important area of development for AI will be reinforcement learning (RL).