Undirected Networks
Towards Multi-Agent Reinforcement Learning using Quantum Boltzmann Machines
Mรผller, Tobias, Roch, Christoph, Schmid, Kyrill, Altmann, Philipp
Reinforcement learning has driven impressive advances in machine learning. Simultaneously, quantum-enhanced machine learning algorithms using quantum annealing underlie heavy developments. Recently, a multi-agent reinforcement learning (MARL) architecture combining both paradigms has been proposed. This novel algorithm, which utilizes Quantum Boltzmann Machines (QBMs) for Q-value approximation has outperformed regular deep reinforcement learning in terms of time-steps needed to converge. However, this algorithm was restricted to single-agent and small 2x2 multi-agent grid domains. In this work, we propose an extension to the original concept in order to solve more challenging problems. Similar to classic DQNs, we add an experience replay buffer and use different networks for approximating the target and policy values. The experimental results show that learning becomes more stable and enables agents to find optimal policies in grid-domains with higher complexity. Additionally, we assess how parameter sharing influences the agents behavior in multi-agent domains. Quantum sampling proves to be a promising method for reinforcement learning tasks, but is currently limited by the QPU size and therefore by the size of the input and Boltzmann machine.
A Reinforcement Learning Benchmark for Autonomous Driving in Intersection Scenarios
Liu, Yuqi, Zhang, Qichao, Zhao, Dongbin
In recent years, control under urban intersection scenarios becomes an emerging research topic. In such scenarios, the autonomous vehicle confronts complicated situations since it must deal with the interaction with social vehicles timely while obeying the traffic rules. Generally, the autonomous vehicle is supposed to avoid collisions while pursuing better efficiency. The existing work fails to provide a framework that emphasizes the integrity of the scenarios while being able to deploy and test reinforcement learning(RL) methods. Specifically, we propose a benchmark for training and testing RL-based autonomous driving agents in complex intersection scenarios, which is called RL-CIS. Then, a set of baselines are deployed consists of various algorithms. The test benchmark and baselines are to provide a fair and comprehensive training and testing platform for the study of RL for autonomous driving in the intersection scenario, advancing the progress of RL-based methods for intersection autonomous driving control. The code of our proposed framework can be found at https://github.com/liuyuqi123/ComplexUrbanScenarios.
Deep Learning: Recurrent Neural Networks in Python
The Recurrent Neural Network (RNN) has been used to obtain state-of-the-art results in sequence modeling. This includes time series analysis, forecasting and natural language processing (NLP). Learn about why RNNs beat old-school machine learning algorithms like Hidden Markov Models. The basics of machine learning and neurons (just a review to get you warmed up!) Neural networks for classification and regression (just a review to get you warmed up!) How to predict stock prices and stock returns with LSTMs in Tensorflow 2 (hint: it's not what you think!) All of the materials required for this course can be downloaded and installed for FREE.
Distributed Mission Planning of Complex Tasks for Heterogeneous Multi-Robot Teams
Ferreira, Barbara Arbanas, Petroviฤ, Tamara, Bogdan, Stjepan
In this paper, we propose a distributed multi-stage optimization method for planning complex missions for heterogeneous multi-robot teams. This class of problems involves tasks that can be executed in different ways and are associated with cross-schedule dependencies that constrain the schedules of the different robots in the system. The proposed approach involves a multi-objective heuristic search of the mission, represented as a hierarchical tree that defines the mission goal. This procedure outputs several favorable ways to fulfill the mission, which directly feed into the next stage of the method. We propose a distributed metaheuristic based on evolutionary computation to allocate tasks and generate schedules for the set of chosen decompositions. The method is evaluated in a simulation setup of an automated greenhouse use case, where we demonstrate the method's ability to adapt the planning strategy depending on the available robots and the given optimization criteria.
Generalization in Text-based Games via Hierarchical Reinforcement Learning
Xu, Yunqiu, Fang, Meng, Chen, Ling, Du, Yali, Zhang, Chengqi
Deep reinforcement learning provides a promising approach for text-based games in studying natural language communication between humans and artificial agents. However, the generalization still remains a big challenge as the agents depend critically on the complexity and variety of training tasks. In this paper, we address this problem by introducing a hierarchical framework built upon the knowledge graph-based RL agent. In the high level, a meta-policy is executed to decompose the whole game into a set of subtasks specified by textual goals, and select one of them based on the KG. Then a sub-policy in the low level is executed to conduct goal-conditioned reinforcement learning. We carry out experiments on games with various difficulty levels and show that the proposed method enjoys favorable generalizability.
Synthesizing Policies That Account For Human Execution Errors Caused By State-Aliasing In Markov Decision Processes
Gopalakrishnan, Sriram, Verma, Mudit, Kambhampati, Subbarao
When humans are given a policy to execute, there can be policy execution errors and deviations in execution if there is uncertainty in identifying a state. So an algorithm that computes a policy for a human to execute ought to consider these effects in its computations. An optimal MDP policy that is poorly executed (because of a human agent) maybe much worse than another policy that is executed with fewer errors. In this paper, we consider the problems of erroneous execution and execution delay when computing policies for a human agent that would act in a setting modeled by a Markov Decision Process. We present a framework to model the likelihood of policy execution errors and likelihood of non-policy actions like inaction (delays) due to state uncertainty. This is followed by a hill climbing algorithm to search for good policies that account for these errors. We then use the best policy found by hill climbing with a branch and bound algorithm to find the optimal policy. We show experimental results in a Gridworld domain and analyze the performance of the two algorithms. We also present human studies that verify if our assumptions on policy execution by humans under state-aliasing are reasonable.
Optimal Path Planning of Autonomous Marine Vehicles in Stochastic Dynamic Ocean Flows using a GPU-Accelerated Algorithm
Chowdhury, Rohit, Subramani, Deepak
Autonomous marine vehicles play an essential role in many ocean science and engineering applications. Planning time and energy optimal paths for these vehicles to navigate in stochastic dynamic ocean environments is essential to reduce operational costs. In some missions, they must also harvest solar, wind, or wave energy (modeled as a stochastic scalar field) and move in optimal paths that minimize net energy consumption. Markov Decision Processes (MDPs) provide a natural framework for sequential decision-making for robotic agents in such environments. However, building a realistic model and solving the modeled MDP becomes computationally expensive in large-scale real-time applications, warranting the need for parallel algorithms and efficient implementation. In the present work, we introduce an efficient end-to-end GPU-accelerated algorithm that (i) builds the MDP model (computing transition probabilities and expected one-step rewards); and (ii) solves the MDP to compute an optimal policy. We develop methodical and algorithmic solutions to overcome the limited global memory of GPUs by (i) using a dynamic reduced-order representation of the ocean flows, (ii) leveraging the sparse nature of the state transition probability matrix, (iii) introducing a neighbouring sub-grid concept and (iv) proving that it is sufficient to use only the stochastic scalar field's mean to compute the expected one-step rewards for missions involving energy harvesting from the environment; thereby saving memory and reducing the computational effort. We demonstrate the algorithm on a simulated stochastic dynamic environment and highlight that it builds the MDP model and computes the optimal policy 600-1000x faster than conventional CPU implementations, making it suitable for real-time use.
Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits
Xiong, Guojun, Li, Jian, Singh, Rahul
We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed R(MA)^2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. The goal is to sequentially choose actions for arms so as to maximize the expected value of the cumulative rewards collected. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy. Our policy is well-defined even if the underlying MDPs are not indexable. We prove that it is asymptotically optimal when the activation budget and number of arms are scaled up, while keeping their ratio as a constant. For the case when the system parameters are unknown, we develop a learning algorithm. Our learning algorithm uses the principle of optimism in the face of uncertainty and further uses a generative model in order to fully exploit the structure of Occupancy-Measured-Reward Index Policy. We call it the R(MA)^2B-UCB algorithm. As compared with the existing algorithms, R(MA)^2B-UCB performs close to an offline optimum policy, and also achieves a sub-linear regret with a low computational complexity. Experimental results show that R(MA)^2B-UCB outperforms the existing algorithms in both regret and run time.
EM Algorithm
EM (Expectation-Maximisation) Algorithm is the go to algorithm whenever we have to do parameter estimation with hidden variables, such as in hidden Markov Chains. For some reason, it is often poorly explained and students end up confused as to what exactly are we maximising in the E-step and M-steps. Here is my attempt at a (hopefully) clear and step by step explanation on exactly how EM Algorithm works.
Unsupervised Machine Learning Hidden Markov Models in Python
The Hidden Markov Model or HMM is all about learning sequences. A lot of the data that would be very useful for us to model is in sequences. Stock prices are sequences of prices. Language is a sequence of words. Credit scoring involves sequences of borrowing and repaying money, and we can use those sequences to predict whether or not you're going to default.