Goto

Collaborating Authors

 Reinforcement Learning


Go-Explore: a New Approach for Hard-Exploration Problems

arXiv.org Machine Learning

A grand challenge in reinforcement learning is intelligent exploration, especially when rewards are sparse or deceptive. Two Atari games serve as benchmarks for such hard-exploration domains: Montezuma's Revenge and Pitfall. On both games, current RL algorithms perform poorly, even those with intrinsic motivation, which is the dominant method to improve performance on hard-exploration domains. To address this shortfall, we introduce a new algorithm called Go-Explore. It exploits the following principles: (1) remember previously visited states, (2) first return to a promising state (without exploration), then explore from it, and (3) solve simulated environments through any available means (including by introducing determinism), then robustify via imitation learning. The combined effect of these principles is a dramatic performance improvement on hard-exploration problems. On Montezuma's Revenge, Go-Explore scores a mean of over 43k points, almost 4 times the previous state of the art. Go-Explore can also harness human-provided domain knowledge and, when augmented with it, scores a mean of over 650k points on Montezuma's Revenge. Its max performance of nearly 18 million surpasses the human world record, meeting even the strictest definition of "superhuman" performance. On Pitfall, Go-Explore with domain knowledge is the first algorithm to score above zero. Its mean score of almost 60k points exceeds expert human performance. Because Go-Explore produces high-performing demonstrations automatically and cheaply, it also outperforms imitation learning work where humans provide solution demonstrations. Go-Explore opens up many new research directions into improving it and weaving its insights into current RL algorithms. It may also enable progress on previously unsolvable hard-exploration problems in many domains, especially those that harness a simulator during training (e.g. robotics).


Probability Functional Descent: A Unifying Perspective on GANs, Variational Inference, and Reinforcement Learning

arXiv.org Machine Learning

The goal of this paper is to provide a unifying view of a wide range of problems of interest in machine learning by framing them as the minimization of functionals defined on the space of probability measures. In particular, we show that generative adversarial networks, variational inference, and actor-critic methods in reinforcement learning can all be seen through the lens of our framework. We then discuss a generic optimization algorithm for our formulation, called probability functional descent (PFD), and show how this algorithm recovers existing methods developed independently in the settings mentioned earlier.


Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement

arXiv.org Artificial Intelligence

The ability to transfer skills across tasks has the potential to scale up reinforcement learning (RL) agents to environments currently out of reach. Recently, a framework based on two ideas, successor features (SFs) and generalised policy improvement (GPI), has been introduced as a principled way of transferring skills. In this paper we extend the SFs & GPI framework in two ways. One of the basic assumptions underlying the original formulation of SFs & GPI is that rewards for all tasks of interest can be computed as linear combinations of a fixed set of features. We relax this constraint and show that the theoretical guarantees supporting the framework can be extended to any set of tasks that only differ in the reward function. Our second contribution is to show that one can use the reward functions themselves as features for future tasks, without any loss of expressiveness, thus removing the need to specify a set of features beforehand. This makes it possible to combine SFs & GPI with deep learning in a more stable way. We empirically verify this claim on a complex 3D environment where observations are images from a first-person perspective. We show that the transfer promoted by SFs & GPI leads to very good policies on unseen tasks almost instantaneously. We also describe how to learn policies specialised to the new tasks in a way that allows them to be added to the agent's set of skills, and thus be reused in the future.


Benchmarking Classic and Learned Navigation in Complex 3D Environments

arXiv.org Artificial Intelligence

Navigation research is attracting renewed interest with the advent of learning-based methods. However, this new line of work is largely disconnected from well-established classic navigation approaches. In this paper, we take a step towards coordinating these two directions of research. We set up classic and learning-based navigation systems in common simulated environments and thoroughly evaluate them in indoor spaces of varying complexity, with access to different sensory modalities. Additionally, we measure human performance in the same environments. We find that a classic pipeline, when properly tuned, can perform very well in complex cluttered environments. On the other hand, learned systems can operate more robustly with a limited sensor suite. Both approaches are still far from human-level performance.


Berkeley startup to train robots like puppets – RtoZ.Org – Latest Technology News

#artificialintelligence

The Researchers from the University of California, Berkeley, have launched a start-up, Embodied Intelligence, Inc., to use the latest techniques of deep reinforcement learning and artificial intelligence to make industrial robots easily teachable. Robots today must be programmed by writing computer code, but imagine donning a VR headset and virtually guiding a robot through a task, like you would move the arms of a puppet, and then letting the robot take it from there. That's the vision of Pieter Abbeel, a professor of electrical engineering and computer science at the University of California, Berkeley, and his students, Peter Chen, Rocky Duan and Tianhao Zhang, who have launched a startup, Embodied Intelligence Inc., to use the latest techniques of deep reinforcement learning and artificial intelligence to make industrial robots easily teachable. "Right now, if you want to set up a robot, you program that robot to do what you want it to do, which takes a lot of time and a lot of expertise," said Abbeel, who is currently on leave to turn his vision into reality. "With our advances in machine learning, we can write a piece of software once -- machine learning code that enables the robot to learn -- and then when the robot needs to be equipped with a new skill, we simply provide new data."


Quantum generative adversarial learning in a superconducting quantum circuit

#artificialintelligence

Machine learning (1), or more broadly artificial intelligence (2), represents an important area with general practical applications where near-term quantum devices may offer a substantial speedup over classical ones. With this vision, an intriguing interdisciplinary field of quantum machine learning/artificial intelligence has emerged and attracted tremendous attention in recent years (3, 4). A number of quantum algorithms that promise exponential speedups have been theoretically proposed (3–6), and some were demonstrated in proof-of-principle experiments (7, 8). Yet, in most of these previous scenarios, the input datasets considered are typically classical. As a result, certain costly processes or techniques, such as quantum random access memories (9), are required to first map the classical data to quantum wave functions so as to be processed by quantum devices, rendering the potential speedups nullified (10).


Safe, Efficient, and Comfortable Velocity Control based on Reinforcement Learning for Autonomous Driving

arXiv.org Machine Learning

A model used for velocity control during car following was proposed based on deep reinforcement learning (RL). To fulfil the multi-objectives of car following, a reward function reflecting driving safety, efficiency, and comfort was constructed. With the reward function, the RL agent learns to control vehicle speed in a fashion that maximizes cumulative rewards, through trials and errors in the simulation environment. A total of 1,341 car-following events extracted from the Next Generation Simulation (NGSIM) dataset were used to train the model. Car-following behavior produced by the model were compared with that observed in the empirical NGSIM data, to demonstrate the model's ability to follow a lead vehicle safely, efficiently, and comfortably. Results show that the model demonstrates the capability of safe, efficient, and comfortable velocity control in that it 1) has small percentages (8\%) of dangerous minimum time to collision values (\textless\ 5s) than human drivers in the NGSIM data (35\%); 2) can maintain efficient and safe headways in the range of 1s to 2s; and 3) can follow the lead vehicle comfortably with smooth acceleration. The results indicate that reinforcement learning methods could contribute to the development of autonomous driving systems.


Multi Agent Reinforcement Learning with Multi-Step Generative Models

arXiv.org Machine Learning

The dynamics between agents and the environment are an important component of multi-agent Reinforcement Learning (RL), and learning them provides a basis for decision making. However, a major challenge in optimizing a learned dynamics model is the accumulation of error when predicting multiple steps into the future. Recent advances in variational inference provide model based solutions that predict complete trajectory segments, and optimize over a latent representation of trajectories. For single-agent scenarios, several recent studies have explored this idea, and showed its benefits over conventional methods. In this work, we extend this approach to the multi-agent case, and effectively optimize over a latent space that encodes multi-agent strategies. We discuss the challenges in optimizing over a latent variable model for multiple agents, both in the optimization algorithm and in the model representation, and propose a method for both cooperative and competitive settings based on risk-sensitive optimization. We evaluate our method on tasks in the multi-agent particle environment and on a simulated RoboCup domain.


Making Deep Q-learning methods robust to time discretization

arXiv.org Machine Learning

Despite remarkable successes, Deep Reinforcement Learning (DRL) is not robust to hyperparameterization, implementation details, or small environment changes (Henderson et al. 2017, Zhang et al. 2018). Overcoming such sensitivity is key to making DRL applicable to real world problems. In this paper, we identify sensitivity to time discretization in near continuous-time environments as a critical factor; this covers, e.g., changing the number of frames per second, or the action frequency of the controller. Empirically, we find that Q-learning-based approaches such as Deep Q- learning (Mnih et al., 2015) and Deep Deterministic Policy Gradient (Lillicrap et al., 2015) collapse with small time steps. Formally, we prove that Q-learning does not exist in continuous time. We detail a principled way to build an off-policy RL algorithm that yields similar performances over a wide range of time discretizations, and confirm this robustness empirically.


Imitation Learning from Imperfect Demonstration

arXiv.org Machine Learning

Imitation learning (IL) has become of great interest because obtaining demonstrations is usually easier than designing reward. Reward is a signal to instruct agents to complete the desired tasks. However, ill-designed reward functions usually lead to unexpected behaviors [Amodei et al., 2016; Dewey, 2014; Everitt and Hutter, 2016]. There are two main approaches that can be used to solve IL: behavioral cloning (BC) [Schaal, 1999], which adopts supervised learning approaches to learn an action predictor that is trained directly from demonstration data; and apprenticeship learning (AL), which attempts to find a policy that is better than the expert policy for a class of cost functions [Abbeel and Ng, 2004]. Even though BC can be trained with supervised learning approaches directly, it has been shown that BC cannot imitate the expert policy without a large amount of demonstration data for not considering the transition of environments [Ross et al., 2011].