Undirected Networks
Exploiting Anonymity in Approximate Linear Programming: Scaling to Large Multiagent MDPs
Robbel, Philipp (Massachusetts Institute of Technology) | Oliehoek, Frans A. (University of Amsterdam and University of Liverpool) | Kochenderfer, Mykel J. (Stanford University)
Many solution methods for Markov Decision Processes (MDPs) exploit structure in the problem and are based on value function factorization. Especially multiagent settings, however, are known to suffer from an exponential increase in value component sizes as interactions become denser, restricting problem sizes and types that can be handled. We present an approach to mitigate this limitation for certain types of multiagent systems, exploiting a property that can be thought of as "anonymous influence" in the factored MDP. We show how representational benefits from anonymity translate into computational efficiencies, both for variable elimination in a factor graph and for the approximate linear programming solution to factored MDPs. Our methods scale to factored MDPs that were previously unsolvable, such as the control of a stochastic disease process over densely connected graphs with 50 nodes and 25 agents.
Bayesian Learning of Other Agents' Finite Controllers for Interactive POMDPs
Panella, Alessandro (University of Illinois at Chicago) | Gmytrasiewicz, Piotr (University of Illinois at Chicago)
We consider an autonomous agent operating in a stochastic, partially-observable, multiagent environment, that explicitly models the other agents as probabilistic deterministic finite-state controllers (PDFCs) in order to predict their actions. We assume that such models are not given to the agent, but instead must be learned from (possibly imperfect) observations of the other agents' behavior. The agent maintains a belief over the other agents' models, that is updated via Bayesian inference. To represent this belief we place a flexible stick-breaking distribution over PDFCs, that allows the posterior to concentrate around controllers whose size is not bounded and scales with the complexity of the observed data. Since this Bayesian inference task is not analytically tractable, we devise a Markov chain Monte Carlo algorithm to approximate the posterior distribution. The agent then embeds the result of this inference into its own decision making process using the interactive POMDP framework. We show that our learning algorithm can learn agent models that are behaviorally accurate for problems of varying complexity, and that the agent's performance increases as a result.
Learning for Decentralized Control of Multiagent Systems in Large, Partially-Observable Stochastic Environments
Liu, Miao (Massachusetts Institute of Technology) | Amato, Christopher (University of New Hampshire) | Anesta, Emily P. (Massachusetts Institute of Technology) | Griffith, John Daniel (Massachusetts Institute of Technology) | How, Jonathan P (Massachusetts Institute of Technology)
Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general framework for multiagent sequential decision-making under uncertainty. Although Dec-POMDPs are typically intractable to solve for real-world problems, recent research on macro-actions (i.e., temporally-extended actions) has significantly increased the size of problems that can be solved. However, current methods assume the underlying Dec-POMDP model is known a priori or a full simulator is available during planning time. To accommodate more realistic scenarios, when such information is not available, this paper presents a policy-based reinforcement learning approach, which learns the agent policies based solely on trajectories generated by previous interaction with the environment (e.g., demonstrations). We show that our approach is able to generate valid macro-action controllers and develop an expectationmaximization (EM) algorithm (called Policy-based EM or PoEM), which has convergence guarantees for batch learning. Our experiments show PoEM is a scalable learning method that can learn optimal policies and improve upon hand-coded “expert” solutions.
Target Surveillance in Adversarial Environments Using POMDPs
Egorov, Maxim (Stanford University) | Kochenderfer, Mykel J. (Stanford University) | Uudmae, Jaak J. (Stanford University)
This paper introduces an extension of the target surveillance problem in which the surveillance agent is exposed to an adversarial ballistic threat. The problem is formulated as a mixed observability Markov decision process (MOMDP), which is a factored variant of the partially observable Markov decision process, to account for state and dynamic uncertainties. The control policy resulting from solving the MOMDP aims to optimize the frequency of target observations and minimize exposure to the ballistic threat. The adversary’s behavior is modeled with a level-k policy, which is used to construct the state transition of the MOMDP. The approach is empirically evaluated against a MOMDP adversary and against a human opponent in a target surveillance computer game. The empirical results demonstrate that, on average, level 3 MOMDP policies outperform lower level reasoning policies as well as human players.
Detection of Plan Deviation in Multi-Agent Systems
Banerjee, Bikramjit (University of Southern Mississippi) | Loscalzo, Steven (Air Force Research Lab) | Thompson, Daniel Lucas (University of Southern Mississippi)
Plan monitoring in a collaborative multi-agent system requires an agent to not only monitor the execution of its own plan, but also to detect possible deviations or failures in the plan execution of its teammates. In domains featuring partial observability and uncertainty in the agents’ sensing and actuation, especially where communication among agents is sparse (as a part of a cost-minimized plan), plan monitoring can be a significant challenge. We design an Expectation Maximization (EM) based algorithm for detection of plan deviation of teammates in such a multi-agent system. However, a direct implementation of this algorithm is intractable, so we also design an alternative approach grounded on the agents’ plans, for tractability. We establish its equivalence to the intractable version, and evaluate these techniques in some challenging tasks.
Reinforcement Learning with Parameterized Actions
Masson, Warwick (University of the Witwatersrand) | Ranchod, Pravesh (University of the Witwatersrand ) | Konidaris, George (Duke University)
We introduce a model-free algorithm for learning in Markov decision processes with parameterized actions—discrete actions with continuous parameters. At each step the agent must select both which action to use and which parameters to use with that action. We introduce the Q-PAMDP algorithm for learning in these domains, show that it converges to a local optimum, and compare it to direct policy search in the goal-scoring and Platform domains.
Offline Evaluation of Online Reinforcement Learning Algorithms
Mandel, Travis (University of Washington) | Liu, Yun-En (Enlearn) | Brunskill, Emma (Carnegie Mellon University) | Popović, Zoran (University of Washington)
In many real-world reinforcement learning problems, we have access to an existing dataset and would like to use it to evaluate various learning approaches. Typically, one would prefer not to deploy a fixed policy, but rather an algorithm that learns to improve its behavior as it gains more experience. Therefore, we seek to evaluate how a proposed algorithm learns in our environment, meaning we need to evaluate how an algorithm would have gathered experience if it were run online. In this work, we develop three new evaluation approaches which guarantee that, given some history, algorithms are fed samples from the distribution that they would have encountered if they were run online. Additionally, we are the first to propose an approach that is provably unbiased given finite data, eliminating bias due to the length of the evaluation. Finally, we compare the sample-efficiency of these approaches on multiple datasets, including one from a real-world deployment of an educational game.
Interaction Point Processes via Infinite Branching Model
Lin, Peng (NICTA and the University of New South Wales) | Zhang, Bang (NICTA ) | Guo, Ting (NICTA) | Wang, Yang (NICTA) | Chen, Fang (NICTA)
Many natural and social phenomena can be modeled by interaction point processes (IPPs) (Diggle et al. 1994), stochastic point processes considering the interaction between points. In this paper, we propose the infinite branching model (IBM), a Bayesian statistical model that can generalize and extend some popular IPPs, e.g., Hawkes process (Hawkes 1971; Hawkes and Oakes 1974). It treats IPP as a mixture of basis point processes with the aid of a distance dependent prior over branching structure that describes the relationship between points. The IBM can estimate point event intensity, interaction mechanism and branching structure simultaneously. A generic Metropolis-within-Gibbs sampling method is also developed for model parameter inference. The experiments on synthetic and real-world data demonstrate the superiority of the IBM.
Compressed Conditional Mean Embeddings for Model-Based Reinforcement Learning
Lever, Guy (University College London) | Shawe-Taylor, John (University College London) | Stafford, Ronnie (University College London) | Szepesvari, Csaba (University of Alberta)
We present a model-based approach to solving Markov decision processes (MDPs) in which the system dynamics are learned using conditional mean embeddings (CMEs). This class of methods comes with strong performance guarantees, and enables planning to be performed in an induced finite (pseudo-)MDP, which approximates the MDP, but can be solved exactly using dynamic programming. Two drawbacks of existing methods exist: firstly, the size of the induced finite (pseudo-)MDP scales quadratically with the amount of data used to learn the model, costing much memory and time when planning with the learned model; secondly, learning the CME itself using powerful kernel least-squares is costly – a second computational bottleneck. We present an algorithm which maintains a rich kernelized CME model class, but solves both problems: firstly we demonstrate that the loss function for the CME model suggests a principled approach to compressing the induced (pseudo-)MDP, leading to faster planning, while maintaining guarantees; secondly we propose to learn the CME model itself using fast sparse-greedy kernel regression well-suited to the RL context. We demonstrate superior performance to existing methods in this class of modelbased approaches on a range of MDPs.
Improving Predictive State Representations via Gradient Descent
Jiang, Nan (University of Michigan) | Kulesza, Alex (University of Michigan) | Singh, Satinder (University of Michigan)
Predictive state representations (PSRs) model dynamical systems using appropriately chosen predictions about future observations as a representation of the current state. In contrast to the hidden states posited by HMMs or RNNs, PSR states are directly observable in the training data; this gives rise to a moment-matching spectral algorithm for learning PSRs that is computationally efficient and statistically consistent when the model complexity matches that of the true system generating the data. In practice, however, model mismatch is inevitable and while spectral learning remains appealingly fast and simple it may fail to find optimal models. To address this problem, we investigate the use of gradient methods for improving spectrally-learned PSRs. We show that only a small amount of additional gradient optimization can lead to significant performance gains, and moreover that initializing gradient methods with the spectral learning solution yields better models in significantly less time than starting from scratch.