Reinforcement Learning
A technique to improve machine learning inspired by the behavior of human infants
From their first years of life, human beings have the innate ability to learn continuously and build mental models of the world, simply by observing and interacting with things or people in their surroundings. Cognitive psychology studies suggest that humans make extensive use of this previously acquired knowledge, particularly when they encounter new situations or when making decisions. Despite the significant recent advances in the field of artificial intelligence (AI), most virtual agents still require hundreds of hours of training to achieve human-level performance in several tasks, while humans can learn how to complete these tasks in a few hours or less. Recent studies have highlighted two key contributors to humans' ability to acquire knowledge so quickly--namely, intuitive physics and intuitive psychology. These intuition models, which have been observed in humans from early stages of development, might be the core facilitators of future learning.
An Actor-Critic-Attention Mechanism for Deep Reinforcement Learning in Multi-view Environments
In reinforcement learning algorithms, leveraging multiple views of the environment can improve the learning of complicated policies. In multi-view environments, due to the fact that the views may frequently suffer from partial observability, their level of importance are often different. In this paper, we propose a deep reinforcement learning method and an attention mechanism in a multi-view environment. Each view can provide various representative information about the environment. Through our attention mechanism, our method generates a single feature representation of environment given its multiple views. It learns a policy to dynamically attend to each view based on its importance in the decision-making process. Through experiments, we show that our method outperforms its state-of-the-art baselines on TORCS racing car simulator and three other complex 3D environments with obstacles. We also provide experimental results to evaluate the performance of our method on noisy conditions and partial observation settings.
GPU-Accelerated Atari Emulation for Reinforcement Learning
Dalton, Steven, Frosio, Iuri, Garland, Michael
We designed and implemented a CUDA port of the Atari Learning Environment (ALE), a system for developing and evaluating deep reinforcement algorithms using Atari games. Our CUDA Learning Environment (CuLE) overcomes many limitations of existing CPUbased Atari emulators and scales naturally to multi-GPU systems. It leverages the parallelization capability of GPUs to run thousands of Atari games simultaneously; by rendering frames directly on the GPU, CuLE avoids the bottleneck arising from the limited CPU-GPU communication bandwidth. Figure 1: In a typical DRL system, environments run As a result, CuLE is able to generate between 40M on CPUs, whereas GPUs execute DNN operations. The and 190M frames per hour using a single GPU, a finding limited CPU-GPU communication bandwidth and small that could be previously achieved only through a cluster set of CPU environments prevent full GPU utilization. of CPUs. We demonstrate the advantages of CuLE by effectively training agents with traditional deep reinforcement learning algorithms and measuring the utilization benchmark for DRL [4, 14], and still represent a challenging and throughput of the GPU. Our analysis further highlights set for the development of new DRL methods.
Delegative Reinforcement Learning: learning to avoid traps with a little help
Most known regret bounds for reinforcement learning are either episodic or assume an environment without traps. We derive a regret bound without making either assumption, by allowing the algorithm to occasionally delegate an action to an external advisor. We thus arrive at a setting of active one-shot model-based reinforcement learning that we call DRL (delegative reinforcement learning.) The algorithm we construct in order to demonstrate the regret bound is a variant of Posterior Sampling Reinforcement Learning supplemented by a subroutine that decides which actions should be delegated. The algorithm is not anytime, since the parameters must be adjusted according to the target time discount. Currently, our analysis is limited to Markov decision processes with finite numbers of hypotheses, states and actions.
Prioritized Guidance for Efficient Multi-Agent Reinforcement Learning Exploration
Exploration efficiency is a challenging problem in multi-agent reinforcement learning (MARL), as the policy learned by confederate MARL depends on the collaborative approach among multiple agents. Another important problem is the less informative reward restricts the learning speed of MARL compared with the informative label in supervised learning. In this work, we leverage on a novel communication method to guide MARL to accelerate exploration and propose a predictive network to forecast the reward of current state-action pair and use the guidance learned by the predictive network to modify the reward function. An improved prioritized experience replay is employed to better take advantage of the different knowledge learned by different agents which utilizes Time-difference (TD) error more effectively. Experimental results demonstrates that the proposed algorithm outperforms existing methods in cooperative multi-agent environments. We remark that this algorithm can be extended to supervised learning to speed up its training.
Combinatorial Keyword Recommendations for Sponsored Search with Deep Reinforcement Learning
Li, Zhipeng, Wu, Jianwei, Sun, Lin, Rong, Tao
In sponsored search, keyword recommendations help advertisers to achieve much better performance within limited budget. Many works have been done to mine numerous candidate keywords from search logs or landing pages. However, the strategy to select from given candidates remains to be improved. The existing relevance-based, popularity-based and regular combinatorial strategies fail to take the internal or external competitions among keywords into consideration. In this paper, we regard keyword recommendations as a combinatorial optimization problem and solve it with a modified pointer network structure. The model is trained on an actor-critic based deep reinforcement learning framework. A pre-clustering method called Equal Size K-Means is proposed to accelerate the training and testing procedure on the framework by reducing the action space. The performance of framework is evaluated both in offline and online environments, and remarkable improvements can be observed.
An intelligent financial portfolio trading strategy using deep Q-learning
Park, Hyungjun, Sim, Min Kyu, Choi, Dong Gu
A goal of financial portfolio trading is maximizing the trader's utility by allocating capital to assets in a portfolio in the investment horizon. Our study suggests an approach for deriving an intelligent portfolio trading strategy using deep Q-learning. In this approach, we introduce a Markov decision process model to enable an agent to learn about the financial environment and develop a deep neural network structure to approximate a Q-function. In addition, we devise three techniques to derive a trading strategy that chooses reasonable actions and is applicable to the real world. First, the action space of the learning agent is modeled as an intuitive set of trading directions that can be carried out for individual assets in the portfolio. Second, we introduce a mapping function that can replace an infeasible agent action in each state with a similar and valuable action to derive a reasonable trading strategy. Last, we introduce a method by which an agent simulates all feasible actions and learns about these experiences to utilize the training data efficiently. To validate our approach, we conduct backtests for two representative portfolios, and we find that the intelligent strategy derived using our approach is superior to the benchmark strategies.
Dynamical Distance Learning for Unsupervised and Semi-Supervised Skill Discovery
Hartikainen, Kristian, Geng, Xinyang, Haarnoja, Tuomas, Levine, Sergey
Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide well-shaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We also show that dynamical distances can be used in a semi-supervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both in simulation and on a real-world robot. We show that our method can learn locomotion skills in simulation without any supervision. We also show that it can learn to turn a valve with a real-world 9-DoF hand, using raw image observations and ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: https://sites.google.com/view/skills-via-distance-learning.
Credit Assignment as a Proxy for Transfer in Reinforcement Learning
Ferret, Johan, Marinier, Raphaël, Geist, Matthieu, Pietquin, Olivier
The ability to transfer representations to novel environments and tasks is a sensible requirement for general learning agents. Despite the apparent promises, transfer in Reinforcement Learning is still an open and under-exploited research area. In this paper, we suggest that credit assignment, regarded as a supervised learning task, could be used to accomplish transfer. Our contribution is twofold: we introduce a new credit assignment mechanism based on self-attention, and show that the learned credit can be transferred to in-domain and out-of-domain scenarios.
Transfer Learning Across Simulated Robots With Different Sensors
Plisnier, Hélène, Steckelmacher, Denis, Roijers, Diederik, Nowé, Ann
For a robot to learn a good policy, it often requires expensive equipment (such as sophisticated sensors) and a prepared training environment conducive to learning. However, it is seldom possible to perfectly equip robots for economic reasons, nor to guarantee ideal learning conditions, when deployed in real-life environments. A solution would be to prepare the robot in the lab environment, when all necessary material is available to learn a good policy. After training in the lab, the robot should be able to get by without the expensive equipment that used to be available to it, and yet still be guaranteed to perform well on the field. The transition between the lab (source) and the real-world environment (target) is related to transfer learning, where the state-space between the source and target tasks differ. We tackle a simulated task with continuous states and discrete actions presenting this challenge, using Bootstrapped Dual Policy Iteration, a model-free actor-critic reinforcement learning algorithm, and Policy Shaping. Specifically, we train a BDPI agent, embodied by a virtual robot performing a task in the V-Rep simulator, sensing its environment through several proximity sensors. The resulting policy is then used by a second agent learning the same task in the same environment, but with camera images as input. The goal is to obtain a policy able to perform the task relying on merely camera images.