Reinforcement Learning
Deep reinforcement learning from human preferences
Christiano, Paul, Leike, Jan, Brown, Tom B., Martic, Miljan, Legg, Shane, Amodei, Dario
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.
Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability
Omidshafiei, Shayegan, Pazis, Jason, Amato, Christopher, How, Jonathan P., Vian, John
Many real-world tasks involve multiple agents with partial observability and limited communication. Learning is challenging in these settings due to local viewpoints of agents, which perceive the world as non-stationary due to concurrently-exploring teammates. Approaches that learn specialized policies for individual tasks face problems when applied to the real world: not only do agents have to learn and store distinct policies for each task, but in practice identities of tasks are often non-observable, making these approaches inapplicable. This paper formalizes and addresses the problem of multi-task multi-agent reinforcement learning under partial observability. We introduce a decentralized single-task learning approach that is robust to concurrent interactions of teammates, and present an approach for distilling single-task policies into a unified policy that performs well across multiple related tasks, without explicit provision of task identity.
DeepMind's AI is teaching itself parkour, and the results are adorable
Keeping up with the latest AI research can be an odd experience. On the one hand, you're aware that you're looking at cutting-edge experimentation, with new papers outlining the ideas and methods that will probably (eventually) snowball into the biggest technological revolution of all time. On the other hand, sometimes what you're looking at is just unavoidably weird and funny. Case in point is a new paper from Google's AI subsidiary DeepMind titled "Emergence of Locomotion Behaviours in Rich Environments." The research explores how reinforcement learning (or RL) can be used to teach a computer to navigate unfamiliar and complex environments.
Google's DeepMind uses reinforcement learning to master parkour
Google has taught its DeepMind AI to navigate a parkour course by using reinforcement learning. Reinforcement learning is the practice of rewarding desirable behaviour. The faster the AI could navigate the virtual parkour course, the greater the reward. Further incentives and penalties were added for various other metrics. "We train several simulated bodies on a diverse set of challenging terrains and obstacles, using a simple reward function based on forward progress," explains Nicolas Heess, a researcher on the project.
Learning Visual Servoing with Deep Features and Fitted Q-Iteration
Lee, Alex X., Levine, Sergey, Abbeel, Pieter
Visual servoing involves choosing actions that move a robot in response to observations from a camera, in order to reach a goal configuration in the world. Standard visual servoing approaches typically rely on manually designed features and analytical dynamics models, which limits their generalization capability and often requires extensive application-specific feature and model engineering. In this work, we study how learned visual features, learned predictive dynamics models, and reinforcement learning can be combined to learn visual servoing mechanisms. We focus on target following, with the goal of designing algorithms that can learn a visual servo using low amounts of data of the target in question, to enable quick adaptation to new targets. Our approach is based on servoing the camera in the space of learned visual features, rather than image pixels or manually-designed keypoints. We demonstrate that standard deep features, in our case taken from a model trained for object classification, can be used together with a bilinear predictive model to learn an effective visual servo that is robust to visual variation, changes in viewing angle and appearance, and occlusions. A key component of our approach is to use a sample-efficient fitted Q-iteration algorithm to learn which features are best suited for the task at hand. We show that we can learn an effective visual servo on a complex synthetic car following benchmark using just 20 training trajectory samples for reinforcement learning. We demonstrate substantial improvement over a conventional approach based on image pixels or hand-designed keypoints, and we show an improvement in sample-efficiency of more than two orders of magnitude over standard model-free deep reinforcement learning algorithms.
Unifying task specification in reinforcement learning
Reinforcement learning tasks are typically specified as Markov decision processes. This formalism has been highly successful, though specifications often couple the dynamics of the environment and the learning objective. This lack of modularity can complicate generalization of the task specification, as well as obfuscate connections between different task settings, such as episodic and continuing. In this work, we introduce the RL task formalism, that provides a unification through simple constructs including a generalization to transition-based discounting. Through a series of examples, we demonstrate the generality and utility of this formalism. Finally, we extend standard learning constructs, including Bellman operators, and extend some seminal theoretical results, including approximation errors bounds. Overall, we provide a well-understood and sound formalism on which to build theoretical results and simplify algorithm use and development.
Industrial AI Podcast – Bonsai – Medium
Check out Episode 1 below and download our latest paper exploring the unique challenges and requirements of Industrial AI. In Part 3 of TWIML's Industrial AI series, Sam Charrington digs into robotics and reinforcement learning with Berkeley PhD student, Chelsea Finn. This talk gets into some of the technical weeds of cutting-edge robotics technologies, including inverse reinforcement learning, meta learning and the benefits and challenges of training robots in simulations. Chelsea also talks about what it's like pursuing a PhD in machine learning and how to keep up with such a rapidly advancing field. Check out the full conversation with Chelsea below.
Hashing Over Predicted Future Frames for Informed Exploration of Deep Reinforcement Learning
Yin, Haiyan, Pan, Sinno Jialin
In reinforcement learning (RL) tasks, an efficient exploration mechanism should be able to encourage an agent to take actions that lead to less frequent states which may yield higher accumulative future return. However, both knowing about the future and evaluating the frequentness of states are non-trivial tasks, especially for deep RL domains, where a state is represented by high-dimensional image frames. In this paper, we propose a novel informed exploration framework for deep RL tasks, where we build the capability for a RL agent to predict over the future transitions and evaluate the frequentness for the predicted future frames in a meaningful manner. To this end, we train a deep prediction model to generate future frames given a state-action pair, and a convolutional autoencoder model to generate deep features for conducting hashing over the seen frames. In addition, to utilize the counts derived from the seen frames to evaluate the frequentness for the predicted frames, we tackle the challenge of making the hash codes for the predicted future frames to match with their corresponding seen frames. In this way, we could derive a reliable metric for evaluating the novelty of the future direction pointed by each action, and hence inform the agent to explore the least frequent one. We use Atari 2600 games as the testing environment and demonstrate that the proposed framework achieves significant performance gain over a state-of-the-art informed exploration approach in most of the domains.
Variance Regularizing Adversarial Learning
Grewal, Karan, Hjelm, R Devon, Bengio, Yoshua
We introduce a novel approach for training adversarial models by replacing the discriminator score with a bi-modal Gaussian distribution over the real/fake indicator variables. In order to do this, we train the Gaussian classifier to match the target bi-modal distribution implicitly through meta-adversarial training. We hypothesize that this approach ensures a non-zero gradient to the generator, even in the limit of a perfect classifier. We test our method against standard benchmark image datasets as well as show the classifier output distribution is smooth and has overlap between the real and fake modes.