Goto

Collaborating Authors

 Reinforcement Learning


Information-Directed Exploration for Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Efficient exploration remains a major challenge for reinforcement learning. One reason is that the variability of the returns often depends on the current state and action, and is therefore heteroscedastic. Classical exploration strategies such as upper confidence bound algorithms and Thompson sampling fail to appropriately account for heteroscedasticity, even in the bandit setting. Motivated by recent findings that address this issue in bandits, we propose to use Information-Directed Sampling (IDS) for exploration in reinforcement learning. As our main contribution, we build on recent advances in distributional reinforcement learning and propose a novel, tractable approximation of IDS for deep Q-learning. The resulting exploration strategy explicitly accounts for both parametric uncertainty and heteroscedastic observation noise. We evaluate our method on Atari games and demonstrate a significant improvement over alternative approaches.


On Improving Decentralized Hysteretic Deep Reinforcement Learning

arXiv.org Machine Learning

Recent successes of value-based multi-agent deep reinforcement learning employ optimism by limiting underestimation updates of value function estimator, through carefully controlled learning rate (Omidshafiei et al., 2017) or reduced update probability (Palmer et al., 2018). To achieve full cooperation when learning independently, an agent must estimate the state values contingent on having optimal teammates; therefore, value overestimation is frequency injected to counteract negative effects caused by unobservable teammate sub-optimal policies and explorations. Aiming to solve this issue through automatic scheduling, this paper introduces a decentralized quantile estimator, which we found empirically to be more stable, sample efficient and more likely to converge to the joint optimal policy.


PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

arXiv.org Machine Learning

Proximal Policy Optimization (PPO) is a highly popular model-free reinforcement learning (RL) approach. However, in continuous state and actions spaces and a Gaussian policy -- common in computer animation and robotics -- PPO is prone to getting stuck in local optima. In this paper, we observe a tendency of PPO to prematurely shrink the exploration variance, which naturally leads to slow progress. Motivated by this, we borrow ideas from CMA-ES, a black-box optimization method designed for intelligent adaptive Gaussian exploration, to derive PPO-CMA, a novel proximal policy optimization approach that can expand the exploration variance on objective function slopes and shrink the variance when close to the optimum. This is implemented by using separate neural networks for policy mean and variance and training the mean and variance in separate passes. Our experiments demonstrate a clear improvement over vanilla PPO in many difficult OpenAI Gym MuJoCo tasks.


Universal Successor Features Approximators

arXiv.org Artificial Intelligence

The ability of a reinforcement learning (RL) agent to learn about many reward functions at the same time has many potential benefits, such as the decomposition of complex tasks into simpler ones, the exchange of information between tasks, and the reuse of skills. We focus on one aspect in particular, namely the ability to generalise to unseen tasks. Parametric generalisation relies on the interpolation power of a function approximator that is given the task description as input; one of its most common form are universal value function approximators (UVFAs). Another way to generalise to new tasks is to exploit structure in the RL problem itself. Generalised policy improvement (GPI) combines solutions of previous tasks into a policy for the unseen task; this relies on instantaneous policy evaluation of old policies under the new reward function, which is made possible through successor features (SFs). Our proposed universal successor features approximators (USFAs) combine the advantages of all of these, namely the scalability of UVFAs, the instant inference of SFs, and the strong generalisation of GPI. We discuss the challenges involved in training a USFA, its generalisation properties and demonstrate its practical benefits and transfer abilities on a large-scale domain in which the agent has to navigate in a first-person perspective three-dimensional environment.


Domain Adaptation for Reinforcement Learning on the Atari

arXiv.org Artificial Intelligence

Deep reinforcement learning agents have recently been successful across a variety of discrete and continuous control tasks; however, they can be slow to train and require a large number of interactions with the environment to learn a suitable policy. This is borne out by the fact that a reinforcement learning agent has no prior knowledge of the world, no pre-existing data to depend on and so must devote considerable time to exploration. Transfer learning can alleviate some of the problems by leveraging learning done on some source task to help learning on some target task. Our work presents an algorithm for initialising the hidden feature representation of the target task. We propose a domain adaptation method to transfer state representations and demonstrate transfer across domains, tasks and action spaces. We utilise adversarial domain adaptation ideas combined with an adversarial autoencoder architecture. We align our new policies' representation space with a pre-trained source policy, taking target task data generated from a random policy. We demonstrate that this initialisation step provides significant improvement when learning a new reinforcement learning task, which highlights the wide applicability of adversarial adaptation methods; even as the task and label/action space also changes.


Continual Match Based Training in Pommerman: Technical Report

arXiv.org Artificial Intelligence

Continual learning is the ability of agents to improve their capacities throughout multiple tasks continually. While recent works in the literature of continual learning mostly focused on developing either particular loss functions or specialized structures of neural network explaining the episodic memory or neural plasticity, we study continual learning from the perspective of the training mechanism. Specifically, we propose a COnitnual Match BAsed Training (COMBAT) framework for training a population of advantage-actor-critic (A2C) agents in Pommerman, a partially observable multi-agent environment with no communication. Following the COMBAT framework, we trained an agent, namely, Navocado, that won the title of the top 1 learning agent in the NeurIPS 2018 Pommerman Competition. Two critical features of our agent are worth mentioning. Firstly, our agent did not learn from any demonstrations. Secondly, our agent is highly reproducible. As a technical report, we articulate the design of state space, action space, reward, and most importantly, the COMBAT framework for our Pommerman agent. We show in the experiments that Pommerman is a perfect environment for studying continual learning, and the agent can improve its performance by continually learning new skills without forgetting the old ones. Finally, the result in the Pommerman Competition verifies the robustness of our agent when competing with various opponents.


Decentralized Computation Offloading for Multi-User Mobile Edge Computing: A Deep Reinforcement Learning Approach

arXiv.org Machine Learning

Mobile edge computing (MEC) emerges recently as a promising solution to relieve resource-limited mobile devices from computation-intensive tasks, which enables devices to offload workloads to nearby MEC servers and improve the quality of computation experience. Nevertheless, by considering an MEC system consisting of multiple mobile users with stochastic task arrivals and wireless channels in this paper, the design of computation offloading policies is challenging to minimize the long-term average computation cost in terms of power consumption and buffering delay. A deep reinforcement learning (DRL) based decentralized dynamic computation offloading strategy is investigated to build a scalable MEC system with limited feedback. Specifically, a continuous action space based DRL approach named deep deterministic policy gradient (DDPG) is adopted to learn efficient computation offloading policies independently at each mobile user. Thus, powers of both local execution and task offloading can be adaptively allocated by the learned policies from each user's local observation of the MEC system. Numerical results are illustrated to demonstrate that efficient policies can be learned at each user, and performance of the proposed DDPG based decentralized strategy outperforms the conventional deep Q-network (DQN) based discrete power control strategy and some other greedy strategies with reduced computation cost. Besides, the power-delay tradeoff is also analyzed for both the DDPG based and DQN based strategies.


An Active Information Seeking Model for Goal-oriented Vision-and-Language Tasks

arXiv.org Machine Learning

As Computer Vision algorithms move from passive analysis of pixels to active reasoning over semantics, the breadth of information algorithms need to reason over has expanded significantly. One of the key challenges in this vein is the ability to identify the information required to make a decision, and select an action that will recover this information. We propose an reinforcement-learning approach that maintains an distribution over its internal information, thus explicitly representing the ambiguity in what it knows, and needs to know, towards achieving its goal. Potential actions are then generated according to particles sampled from this distribution. For each potential action a distribution of the expected answers is calculated, and the value of the information gained is obtained, as compared to the existing internal information. We demonstrate this approach applied to two vision-language problems that have attracted significant recent interest, visual dialogue and visual query generation. In both cases the method actively selects actions that will best reduce its internal uncertainty, and outperforms its competitors in achieving the goal of the challenge.


Double Deep Q-Learning for Optimal Execution

arXiv.org Machine Learning

Optimal trade execution is an important problem faced by essentially all traders. Much research into optimal execution uses stringent model assumptions and applies continuous time stochastic control to solve them. Here, we instead take a model free approach and develop a variation of Deep Q-Learning to estimate the optimal actions of a trader. The model is a fully connected Neural Network trained using Experience Replay and Double DQN with input features given by the current state of the limit order book, other trading signals, and available execution actions, while the output is the Q-value function estimating the future rewards under an arbitrary action. We apply our model to nine different stocks and find that it outperforms the standard benchmark approach on most stocks using the measures of (i) mean and median out-performance, (ii) probability of out-performance, and (iii) gain-loss ratios.


A Logarithmic Barrier Method For Proximal Policy Optimization

arXiv.org Machine Learning

Proximal policy optimization(PPO) [ Schulman et al., 2017 ] has been proposed as a first-order optimization method for reinforcement learning. We should notice that an exterior penalty method is used in it. Often, the minimizers of the exterior penalty functions approach feasibility only in the limits as the penalty parameter grows increasingly large. Therefore, it may result in the low level of sampling efficiency. This method, which we call proximal policy optimization with barrier method (PPO-B), keeps almost all advantageous spheres of PPO such as easy implementation and good generalization. Specifically, a new surrogate objective with interior penalty method is proposed to avoid the defect arose from exterior penalty method. Conclusions can be draw that PPO-B is able to outperform PPO in terms of sampling efficiency since PPO-B achieved clearly better performance on Atari and Mujoco environment than PPO.