Reinforcement Learning
Deep Reinforcement Learning with Model Learning and Monte Carlo Tree Search in Minecraft
Deep reinforcement learning has been successfully applied to several visual-input tasks using model-free methods. In this paper, we propose a model-based approach that combines learning a DNN-based transition model with Monte Carlo tree search to solve a block-placing task in Minecraft. Our learned transition model predicts the next frame and the rewards one step ahead given the last four frames of the agent's first-person-view image and the current action. Then a Monte Carlo tree search algorithm uses this model to plan the best sequence of actions for the agent to perform. On the proposed task in Minecraft, our model-based approach reaches the performance comparable to the Deep Q-Network's, but learns faster and, thus, is more training sample efficient. Keywords: Acknowledgements Reinforcement Learning, Model-Based Reinforcement Learning, Deep Learning, Model Learning, Monte Carlo Tree Search I would like to express my sincere gratitude to my supervisor Dr. Stefan Uhlich for his continuous support, patience, and immense knowledge that helped me a lot during this study. My thanks and appreciation also go to my colleague Anna Konobelkina for insightful comments on the paper as well as to Sony Europe Limited for providing the resources for this project.
Everett
Everett, Richard (University of Oxford) | Roberts, Stephen (University of Oxford)
Humans, like all animals, both cooperate and compete with each other. Through these interactions we learn to observe, act, and manipulate to maximise our utility function, and continue doing so as others learn with us. This is a decentralised non-stationary learning problem, where to survive and flourish an agent must adapt to the gradual changes of other agents as they learn, as well as capitalise on sudden shifts in their behaviour. To learn in the presence of such non-stationarity, we introduce the Switching Agent Model (SAM) that combines traditional deep reinforcement learning – which typically performs poorly in such settings – with opponent modelling, using uncertainty estimations to robustly switch between multiple policies. We empirically show the success of our approach in a multi-agent continuous-action environment, demonstrating SAM's ability to identify, track, and adapt to gradual and sudden changes in the behaviour of non-stationary agents.
Learning to Act in Partially Structured Dynamic Environment
Huang, Chen (University of Southern California) | Liu, Lantao (Indiana University - Bloomington) | Sukhatme, Gaurav (University of Southern California)
We investigate the scenario that a robot needs to reach a designated goal after taking a sequence of appropriate actions in a non-static environment that is partially structured.One application example is to control a marine vehicle to move in the ocean. The ocean environment is dynamic and the ocean waves typically result in strong disturbances that can disturb the vehicle's motion. Modeling such dynamic environment is non-trivial, and integrating such model in the robotic motion control is particularly difficult. Fortunately, the ocean currents usually form some local patterns (e.g. vortex) and thus the environment is partially structured. The historically observed data can be used to train the robot to learn to interact with the ocean flow disturbances. In this paper we propose a method that applies the deep reinforcement learning framework to learn such partially structured complex disturbances.Our preliminary results show that, by training the robot under artificial and real ocean disturbances, the robot is able to successfully act in complex and spatiotemporal environments.
Model-Based Reinforcement Learning under Periodical Observability
Klima, Richard (University of Liverpool) | Tuyls, Karl (University of Liverpool) | Oliehoek, Frans A. (University of Liverpool)
The uncertainty induced by unknown attacker locations is one of the problems in deploying AI methods to security domains. We study a model with partial observability of the attacker location and propose a novel reinforcement learning method using partial information about attacker behaviour coming from the system. This method is based on deriving beliefs about underlying states using Bayesian inference. These beliefs are then used in the QMDP algorithm. We particularly design the algorithm for spatial security games, where the defender faces intelligent and adversarial opponents.
Learning Against Non-Stationary Agents with Opponent Modelling and Deep Reinforcement Learning
Everett, Richard (University of Oxford) | Roberts, Stephen (University of Oxford)
Humans, like all animals, both cooperate and compete with each other. Through these interactions we learn to observe, act, and manipulate to maximise our utility function, and continue doing so as others learn with us. This is a decentralised non-stationary learning problem, where to survive and flourish an agent must adapt to the gradual changes of other agents as they learn, as well as capitalise on sudden shifts in their behaviour. To learn in the presence of such non-stationarity, we introduce the Switching Agent Model (SAM) that combines traditional deep reinforcement learning – which typically performs poorly in such settings – with opponent modelling, using uncertainty estimations to robustly switch between multiple policies. We empirically show the success of our approach in a multi-agent continuous-action environment, demonstrating SAM’s ability to identify, track, and adapt to gradual and sudden changes in the behaviour of non-stationary agents.
Multiagent Soft Q-Learning
Wei, Ermo (George Mason University) | Wicke, Drew (George Mason University) | Freelan, David (George Mason University) | Luke, Sean (George Mason University)
Policy gradient methods are often applied to reinforcement learning in continuous multiagent games. These methods perform local search in the joint-action space, and as we show, they are susceptable to a game-theoretic pathology known as relative overgeneralization. To resolve this issue, we propose Multiagent Soft Q-learning, which can be seen as the analogue of applying Q-learning to continuous controls. We compare our method to MADDPG, a state-of-the-art approach, and show that our method achieves better coordination in multiagent cooperative tasks, converging to better local optima in the joint action space.
Bayesian Q-learning with Assumed Density Filtering
Jeong, Heejin (University of Pennsylvania) | Lee, Daniel D. (University of Pennsylvania)
While off-policy temporal difference methods have been broadly used in reinforcement learning due to their efficiency and simple implementation, their Bayesian counterparts have been relatively understudied. This is mainly because the max operator in the Bellman optimality equation brings non-linearity and inconsistent distributions over value function. In this paper, we introduce a new Bayesian approach to off-policy TD methods using Assumed Density Filtering, called ADFQ, which updates beliefs on action-values (Q) through an online Bayesian inference method. Uncertainty measures in the beliefs not only are used in exploration but they provide a natural regularization in the belief updates. We also present a connection between ADFQ and Q-learning. Our empirical results show the proposed ADFQ algorithms outperform comparing algorithms in several task domains. Moreover, our algorithms improve general drawbacks in BRL such as efficiency, usage of uncertainty, and nonlinearity.
A Survey on Application of Machine Learning Techniques in Optical Networks
Musumeci, Francesco, Rottondi, Cristina, Nag, Avishek, Macaluso, Irene, Zibar, Darko, Ruffini, Marco, Tornatore, Massimo
Today, the amount of data that can be retrieved from communications networks is extremely high and diverse (e.g., data regarding users behavior, traffic traces, network alarms, signal quality indicators, etc.). Advanced mathematical tools are required to extract useful information from this large set of network data. In particular, Machine Learning (ML) is regarded as a promising methodological area to perform network-data analysis and enable, e.g., automatized network self-configuration and fault management. In this survey we classify and describe relevant studies dealing with the applications of ML to optical communications and networking. Optical networks and system are facing an unprecedented growth in terms of complexity due to the introduction of a huge number of adjustable parameters (such as routing configurations, modulation format, symbol rate, coding schemes, etc.), mainly due to the adoption of, among the others, coherent transmission/reception technology, advanced digital signal processing and to the presence of nonlinear effects in optical fiber systems. Although a good number of research papers have appeared in the last years, the application of ML to optical networks is still in its early stage. In this survey we provide an introductory reference for researchers and practitioners interested in this field. To stimulate further work in this area, we conclude the paper proposing new possible research directions.
Hierarchical Approaches for Reinforcement Learning in Parameterized Action Space
Wei, Ermo (George Mason University) | Wicke, Drew (George Mason University) | Luke, Sean (George Mason University)
We explore Deep Reinforcement Learning in a parameterized action space. Specifically, we investigate how to achieve sample-efficient end-to-end training in these tasks. We propose a new compact architecture for the tasks where the parameter policy is conditioned on the output of the discrete action policy. We also propose two new methods based on the state-of-the-art algorithms Trust Region Policy Optimization (TRPO) and Stochastic Value Gradient (SVG) to train such an architecture. We demonstrate that these methods outperform the state of the art method, Parameterized Action DDPG, on test domains.
Inverse Reinforcement Learning via Nonparametric Subgoal Modeling
Šošić, Adrian (Technische Universität Darmstadt) | Zoubir, Abdelhak M. (Technische Universität Darmstadt) | Koeppl, Heinz (Technische Universität Darmstadt)
Recent advances in the field of inverse reinforcement learning (IRL) have yielded sophisticated frameworks which relax the original modeling assumption that the behavior of an observed agent reflects only a single intention. Instead, the demonstration data is separated into parts to account for the fact that different trajectories may correspond to different intentions, e.g., because they were generated by different domain experts. In this work, we go one step further: using the intuitive concept of subgoals, we build upon the premise that even a single trajectory can be explained more efficiently locally within a certain context than globally, enabling a more compact representation of the observed behavior. Based on this assumption, we build an implicit intentional model of the agent's goals to forecast its behavior in unobserved situations. The result is an integrated Bayesian prediction framework which provides spatially smooth policy estimates that are consistent with the expert's plan and significantly outperform existing IRL solutions. In addition, the framework can be naturally extended to handle scenarios with time-varying expert intentions.