Goto

Collaborating Authors

 Reinforcement Learning


Value-Decomposition Multi-Agent Actor-Critics

arXiv.org Artificial Intelligence

The exploitation of extra state information has been an active research area in multi-agent reinforcement learning (MARL). QMIX represents the joint action-value using a non-negative function approximator and achieves the best performance, by far, on multi-agent benchmarks, StarCraft II micromanagement tasks. However, our experiments show that, in some cases, QMIX is incompatible with A2C, a training paradigm that promotes algorithm training efficiency. To obtain a reasonable trade-off between training efficiency and algorithm performance, we extend value-decomposition to actor-critics that are compatible with A2C and propose a novel actor-critic framework, value-decomposition actor-critics (VDACs). We evaluate VDACs on the testbed of StarCraft II micromanagement tasks and demonstrate that the proposed framework improves median performance over other actor-critic methods. Furthermore, we use a set of ablation experiments to identify the key factors that contribute to the performance of VDACs.


Towards Efficient Connected and Automated Driving System via Multi-agent Graph Reinforcement Learning

arXiv.org Machine Learning

Connected and automated vehicles (CAVs) have attracted more and more attention recently. The fast actuation time allows them having the potential to promote the efficiency and safety of the whole transportation system. Due to technical challenges, there will be a proportion of vehicles that can be equipped with automation while other vehicles are without automation. Instead of learning a reliable behavior for ego automated vehicle, we focus on how to improve the outcomes of the total transportation system by allowing each automated vehicle to learn cooperation with each other and regulate human-driven traffic flow. One of state of the art method is using reinforcement learning to learn intelligent decision making policy. However, direct reinforcement learning framework cannot improve the performance of the whole system. In this article, we demonstrate that considering the problem in multi-agent setting with shared policy can help achieve better system performance than non-shared policy in single-agent setting. Furthermore, we find that utilization of attention mechanism on interaction features can capture the interplay between each agent in order to boost cooperation. To the best of our knowledge, while previous automated driving studies mainly focus on enhancing individual's driving performance, this work serves as a starting point for research on system-level multi-agent cooperation performance using graph information sharing. We conduct extensive experiments in car-following and unsignalized intersection settings. The results demonstrate that CAVs controlled by our method can achieve the best performance against several state of the art baselines.


Quantile Reinforcement Learning

#artificialintelligence

I've been exploring reinforcement learning that takes advantage of uncertainty. In particular, I have implemented a basic version of QR-DQN-1 from Distributional Reinforcement Learning with Quantile Regression. Doing so required filling in some practical details from the paper, which I'm going to explain here. The approach is an extension of Deep Q-learning, which involves attempting to learn the value of being in a given state and taking an action to maximize this value (for more background, see this post). We think of the value of being in a state as a random variable drawn from some unknown distribution.


Maintaining the Illusion of Reality: Transfer in RL by Keeping Agents in the DARC

#artificialintelligence

Reinforcement learning (RL) is often touted as a promising approach for costly and risk-sensitive applications, yet practicing and learning in those domains directly is expensive. It costs time (e.g., OpenAI's Dota2 project used 10,000 years of experience), it costs money (e.g., "inexpensive" robotic arms used in research typically cost $10,000 to $30,000), and it could even be dangerous to humans. How can an intelligent agent learn to solve tasks in environments in which it cannot practice? For many tasks, such as assistive robotics and self-driving cars, we may have access to a different practice area, which we will call the source domain. While the source domain has different dynamics than the target domain, experience in the source domain is much cheaper to collect.


Decentralized Reinforcement Learning

#artificialintelligence

Many associations in the world like the biological ecosystems, government and corporations are physically decentralized however they are unified in the sense of their functionality. For instance, a financial institution operates with a global policy of maximizing their profits, hence appearing as a single entity; however, this entity abstraction is an illusion, as a financial institution is composed of a group of individual human agents solving their optimization problems with our without collaboration. The policy function parameters are fine-tuned depending on the gradients of the defined objective function. This approach is called the monolithic decision-making framework as the policy function's learning parameters are coupled globally solely using an objective function. Having covered a brief background of a centralized reinforcement learning framework, let us move forward to some promising decentralized reinforcement learning frameworks.


IntelligentPooling: Practical Thompson Sampling for mHealth

arXiv.org Machine Learning

In mobile health (mHealth) smart devices deliver behavioral treatments repeatedly over time to a user with the goal of helping the user adopt and maintain healthy behaviors. Reinforcement learning appears ideal for learning how to optimally make these sequential treatment decisions. However, significant challenges must be overcome before reinforcement learning can be effectively deployed in a mobile healthcare setting. In this work we are concerned with the following challenges: 1) individuals who are in the same context can exhibit differential response to treatments 2) only a limited amount of data is available for learning on any one individual, and 3) non-stationary responses to treatment. To address these challenges we generalize Thompson-Sampling bandit algorithms to develop IntelligentPooling. IntelligentPooling learns personalized treatment policies thus addressing challenge one. To address the second challenge, IntelligentPooling updates each user's degree of personalization while making use of available data on other users to speed up learning. Lastly, IntelligentPooling allows responsivity to vary as a function of a user's time since beginning treatment, thus addressing challenge three. We show that IntelligentPooling achieves an average of 26% lower regret than state-of-the-art. We demonstrate the promise of this approach and its ability to learn from even a small group of users in a live clinical trial.


Deep Reinforcement Learning using Cyclical Learning Rates

arXiv.org Machine Learning

Deep Reinforcement Learning (DRL) methods often rely on the meticulous tuning of hyperparameters to successfully resolve problems. One of the most influential parameters in optimization procedures based on stochastic gradient descent (SGD) is the learning rate. We investigate cyclical learning and propose a method for defining a general cyclical learning rate for various DRL problems. In this paper we present a method for cyclical learning applied to complex DRL problems. Our experiments show that, utilizing cyclical learning achieves similar or even better results than highly tuned fixed learning rates. This paper presents the first application of cyclical learning rates in DRL settings and is a step towards overcoming manual hyperparameter tuning.


Never Stop Learning: The Effectiveness of Fine-Tuning in Robotic Reinforcement Learning

arXiv.org Machine Learning

One of the great promises of robot learning systems is that they will be able to learn from their mistakes and continuously adapt to ever-changing environments. Despite this potential, most of the robot learning systems today are deployed as a fixed policy and they are not being adapted after their deployment. Can we efficiently adapt previously learned behaviors to new environments, objects and percepts in the real world? In this paper, we present a method and empirical evidence towards a robot learning framework that facilitates continuous adaption. In particular, we demonstrate how to adapt vision-based robotic manipulation policies to new variations by fine-tuning via off-policy reinforcement learning, including changes in background, object shape and appearance, lighting conditions, and robot morphology. Further, this adaptation uses less than 0.2% of the data necessary to learn the task from scratch. We find that our approach of adapting pre-trained policies leads to substantial performance gains over the course of fine-tuning, and that pre-training via RL is essential: training from scratch or adapting from supervised ImageNet features are both unsuccessful with such small amounts of data. We also find that these positive results hold in a limited continual learning setting, in which we repeatedly fine-tune a single lineage of policies using data from a succession of new tasks. Our empirical conclusions are consistently supported by experiments on simulated manipulation tasks, and by 52 unique fine-tuning experiments on a real robotic grasping system pre-trained on 580,000 grasps.


Quantity vs. Quality: On Hyperparameter Optimization for Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Reinforcement learning algorithms can show strong variation in performance between training runs with different random seeds. In this paper we explore how this affects hyperparameter optimization when the goal is to find hyperparameter settings that perform well across random seeds. In particular, we benchmark whether it is better to explore a large quantity of hyperparameter settings via pruning of bad performers, or if it is better to aim for quality of collected results by using repetitions. For this we consider the Successive Halving, Random Search, and Bayesian Optimization algorithms, the latter two with and without repetitions. We apply these to tuning the PPO2 algorithm on the Cartpole balancing task and the Inverted Pendulum Swing-up task. We demonstrate that pruning may negatively affect the optimization and that repeated sampling does not help in finding hyperparameter settings that perform better across random seeds. From our experiments we conclude that Bayesian optimization with a noise robust acquisition function is the best choice for hyperparameter optimization in reinforcement learning tasks.


Moody Learners -- Explaining Competitive Behaviour of Reinforcement Learning Agents

arXiv.org Artificial Intelligence

Designing the decision-making processes of artificial agents that are involved in competitive interactions is a challenging task. In a competitive scenario, the agent does not only have a dynamic environment but also is directly affected by the opponents' actions. Observing the Q-values of the agent is usually a way of explaining its behavior, however, do not show the temporal-relation between the selected actions. We address this problem by proposing the \emph{Moody framework}. We evaluate our model by performing a series of experiments using the competitive multiplayer Chef's Hat card game and discuss how our model allows the agents' to obtain a holistic representation of the competitive dynamics within the game.