Reinforcement Learning
Simultaneously Learning Vision and Feature-based Control Policies for Real-world Ball-in-a-Cup
Schwab, Devin, Springenberg, Tobias, Martins, Murilo F., Lampe, Thomas, Neunert, Michael, Abdolmaleki, Abbas, Hertweck, Tim, Hafner, Roland, Nori, Francesco, Riedmiller, Martin
We present a method for fast training of vision based control policies on real robots. The key idea behind our method is to perform multi-task Reinforcement Learning with auxiliary tasks that differ not only in the reward to be optimized but also in the state-space in which they operate. In particular, we allow auxiliary task policies to utilize task features that are available only at training-time. This allows for fast learning of auxiliary policies, which subsequently generate good data for training the main, vision-based control policies. This method can be seen as an extension of the Scheduled Auxiliary Control (SAC-X) framework. We demonstrate the efficacy of our method by using both a simulated and real-world Ball-in-a-Cup game controlled by a robot arm. In simulation, our approach leads to significant learning speed-ups when compared to standard SAC-X. On the real robot we show that the task can be learned from-scratch, i.e., with no transfer from simulation and no imitation learning. Videos of our learned policies running on the real robot can be found at https://sites.google.com/view/rss-2019-sawyer-bic/.
Message-Dropout: An Efficient Training Method for Multi-Agent Deep Reinforcement Learning
Kim, Woojun, Cho, Myungsik, Sung, Youngchul
In this paper, we propose a new learning technique named message-dropout to improve the performance for multi-agent deep reinforcement learning under two application scenarios: 1) classical multi-agent reinforcement learning with direct message communication among agents and 2) centralized training with decentralized execution. In the first application scenario of multi-agent systems in which direct message communication among agents is allowed, the message-dropout technique drops out the received messages from other agents in a block-wise manner with a certain probability in the training phase and compensates for this effect by multiplying the weights of the dropped-out block units with a correction probability. The applied message-dropout technique effectively handles the increased input dimension in multi-agent reinforcement learning with communication and makes learning robust against communication errors in the execution phase. In the second application scenario of centralized training with decentralized execution, we particularly consider the application of the proposed message-dropout to Multi-Agent Deep Deterministic Policy Gradient (MADDPG), which uses a centralized critic to train a decentralized actor for each agent. We evaluate the proposed message-dropout technique for several games, and numerical results show that the proposed message-dropout technique with proper dropout rate improves the reinforcement learning performance significantly in terms of the training speed and the steady-state performance in the execution phase.
Parenting: Safe Reinforcement Learning from Human Input
Frye, Christopher, Feige, Ilya
Autonomous agents trained via reinforcement learning present numerous safety concerns: reward hacking, negative side effects, and unsafe exploration, among others. In the context of near-future autonomous agents, operating in environments where humans understand the existing dangers, human involvement in the learning process has proved a promising approach to AI Safety. Here we demonstrate that a precise framework for learning from human input, loosely inspired by the way humans parent children, solves a broad class of safety problems in this context. We show that our PARENTING algorithm solves these problems in the relevant AI Safety gridworlds of Leike et al. (2017), that an agent can learn to outperform its parent as it "matures", and that policies learnt through PARENTING are generalisable to new environments.
A* Tree Search for Portfolio Management
Gao, Xiaojie, Tu, Shikui, Xu, Lei
We propose a planning-based method to teach an agent to manage portfolio from scratch. Our approach combines deep reinforcement learning techniques with search techniques like AlphaGo. By uniting the advantages in A* search algorithm with Monte Carlo tree search, we come up with a new algorithm named A* tree search in which best information is returned to guide next search. Also, the expansion mode of Monte Carlo tree is improved for a higher utilization of the neural network. The suggested algorithm can also optimize non-differentiable utility function by combinatorial search. This technique is then used in our trading system. The major component is a neural network that is trained by trading experiences from tree search and outputs prior probability to guide search by pruning away branches in turn. Experimental results on simulated and real financial data verify the robustness of the proposed trading system and the trading system produces better strategies than several approaches based on reinforcement learning.
A new Potential-Based Reward Shaping for Reinforcement Learning Agent
Badnava, Babak, Mozayani, Nasser
Potential-based reward shaping (PBRS) is a particular category of machine learning methods which aims to improve the learning speed of a reinforcement learning agent by extracting and utilizing extra knowledge while performing a task. There are two steps in the process of transfer learning: extracting knowledge from previously learned tasks and transferring that knowledge to use it in a target task. The latter step is well discussed in the literature with various methods being proposed for it, while the former has been explored less. With this in mind, the type of knowledge that is transmitted is very important and can lead to considerable improvement. Among the literature of both the transfer learning and the potential-based reward shaping, a subject that has never been addressed is the knowledge gathered during the learning process itself. In this paper, we presented a novel potential-based reward shaping method that attempted to extract knowledge from the learning process. The proposed method extracts knowledge from episodes' cumulative rewards. The proposed method has been evaluated in the Arcade learning environment and the results indicate an improvement in the learning process in both the single-task and the multi-task reinforcement learner agents.
SURREAL
Our goal is to make Deep Reinforcement Learning accessible to everyone. We introduce Surreal, an open-source, reproducible, and scalable distributed reinforcement learning framework. Surreal provides a high-level abstraction for building distributed reinforcement learning algorithms. We implement our distributed variants of PPO and DDPG in the current release. Click to see detailed documentation!
Communication Topologies Between Learning Agents in Deep Reinforcement Learning
Adjodah, Dhaval, Calacci, Dan, Dubey, Abhimanyu, Goyal, Anirudh, Krafft, Peter, Moro, Esteban, Pentland, Alex
A common technique to improve speed and robustness of learning in deep reinforcement learning (DRL) and many other machine learning algorithms is to run multiple learning agents in parallel. A neglected component in the development of these algorithms has been how best to arrange the learning agents involved to better facilitate distributed search. Here we draw upon results from the networked optimization and collective intelligence literatures suggesting that arranging learning agents in less than fully connected topologies (the implicit way agents are commonly arranged in) can improve learning. We explore the relative performance of four popular families of graphs and observe that one such family (Erdos-Renyi random graphs) empirically outperforms the standard fully-connected communication topology across several DRL benchmark tasks. We observe that 1000 learning agents arranged in an Erdos-Renyi graph can perform as well as 3000 agents arranged in the standard fully-connected topology, showing the large learning improvement possible when carefully designing the topology over which agents communicate. We complement these empirical results with a preliminary theoretical investigation of why less than fully connected topologies can perform better. Overall, our work suggests that distributed machine learning algorithms could be made more efficient if the communication topology between learning agents was optimized.
Competitive Experience Replay
Liu, Hao, Trott, Alexander, Socher, Richard, Xiong, Caiming
Deep learning has achieved remarkable successes in solving challenging reinforcement learning (RL) problems when dense reward function is provided. However, in sparse reward environment it still often suffers from the need to carefully shape reward function to guide policy optimization. This limits the applicability of RL in the real world since both reinforcement learning and domain-specific knowledge are required. It is therefore of great practical importance to develop algorithms which can learn from a binary signal indicating successful task completion or other unshaped, sparse reward signals. We propose a novel method called competitive experience replay, which efficiently supplements a sparse reward by placing learning in the context of an exploration competition between a pair of agents. Our method complements the recently proposed hindsight experience replay (HER) by inducing an automatic exploratory curriculum. We evaluate our approach on the tasks of reaching various goal locations in an ant maze and manipulating objects with a robotic arm. Each task provides only binary rewards indicating whether or not the goal is achieved. Our method asymmetrically augments these sparse rewards for a pair of agents each learning the same task, creating a competitive game designed to drive exploration. Extensive experiments demonstrate that this method leads to faster converge and improved task performance.
Google open-sources PlaNet, an AI agent that learns about the world from images
But it's not always practical; model-free approaches, which aim to get agents to directly predict actions from observations about their world, can take weeks of training. Model-based reinforcement learning is a viable alternative -- it has agents come up with a general model of their environment they can use to plan ahead. But in order to accurately forecast actions in unfamiliar surroundings, those agents have to formulate rules from experience. Toward that end, Google in collaboration with DeepMind today introduced the Deep Planning Network (PlaNet) agent, which learns a world model from image inputs and leverages it for planning. It's able to solve a variety of image-based tasks with up to 5,000 percent the data efficiency, Google says, while maintaining competitiveness with advanced model-free agents.
Mobile AI Through Machine Learning Algorithms
Machine learning (ML) is a method of artificial intelligence (AI) in which data is used to train a machine so that it can make decisions or predictions on its own. In a previous blog, Setting up your machine learning projects for success, we discussed how data and modelling play a key role in allowing a machine to learn and improve, and our e-book explains how ML fits into the bigger picture of AI. An ML algorithm is a key element that ties this all together, and as we'll discover in this blog, there are four main categories of ML algorithms – supervised machine learning, unsupervised machine learning, semi-supervised machine learning, and reinforcement machine learning. Many of today's ML algorithms can be considered supervised, which means the model is iteratively trained by running the algorithm and comparing its output against data that is known to be correct. Once training is complete, the algorithm and model are ready for inference.