Goto

Collaborating Authors

 mmdp


Efficient Algorithms for Mitigating Uncertainty and Risk in Reinforcement Learning

Su, Xihong

arXiv.org Artificial Intelligence

This dissertation makes three main contributions. First, We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum. Second, We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC. Third, We propose model-free Q-learning algorithms for computing policies with risk-averse objectives: ERM-TRC and EVaR-TRC. The challenge is that Q-learning ERM Bellman may not be a contraction. Instead, we use the monotonicity of Q-learning ERM Bellman operators to derive a rigorous proof that the ERM-TRC and the EVaR-TRC Q-learning algorithms converge to the optimal risk-averse value functions. The proposed Q-learning algorithms compute the optimal stationary policy for ERM-TRC and EVaR-TRC.


Solving Multi-Model MDPs by Coordinate Ascent and Dynamic Programming

Su, Xihong, Petrik, Marek

arXiv.org Artificial Intelligence

Multi-model Markov decision process (MMDP) is a promising framework for computing policies that are robust to parameter uncertainty in MDPs. MMDPs aim to find a policy that maximizes the expected return over a distribution of MDP models. Because MMDPs are NP-hard to solve, most methods resort to approximations. In this paper, we derive the policy gradient of MMDPs and propose CADP, which combines a coordinate ascent method and a dynamic programming algorithm for solving MMDPs. The main innovation of CADP compared with earlier algorithms is to take the coordinate ascent perspective to adjust model weights iteratively to guarantee monotone policy improvements to a local maximum. A theoretical analysis of CADP proves that it never performs worse than previous dynamic programming algorithms like WSU. Our numerical results indicate that CADP substantially outperforms existing methods on several benchmark problems.


Inverse Reinforcement Learning without Reinforcement Learning

Swamy, Gokul, Choudhury, Sanjiban, Bagnell, J. Andrew, Wu, Zhiwei Steven

arXiv.org Artificial Intelligence

Inverse Reinforcement Learning (IRL) is a powerful set of techniques for imitation learning that aims to learn a reward function that rationalizes expert demonstrations. Unfortunately, traditional IRL methods suffer from a computational weakness: they require repeatedly solving a hard reinforcement learning (RL) problem as a subroutine. This is counter-intuitive from the viewpoint of reductions: we have reduced the easier problem of imitation learning to repeatedly solving the harder problem of RL. Another thread of work has proved that access to the side-information of the distribution of states where a strong policy spends time can dramatically reduce the sample and computational complexities of solving an RL problem. In this work, we demonstrate for the first time a more informed imitation learning reduction where we utilize the state distribution of the expert to alleviate the global exploration component of the RL subroutine, providing an exponential speedup in theory. In practice, we find that we are able to significantly speed up the prior art on continuous control tasks.


Decentralised Q-Learning for Multi-Agent Markov Decision Processes with a Satisfiability Criterion

Keval, Keshav P., Borkar, Vivek S.

arXiv.org Artificial Intelligence

In this paper, we propose a reinforcement learning algorithm to solve a multi-agent Markov decision process (MMDP). The goal, inspired by Blackwell's Approachability Theorem, is to lower the time average cost of each agent to below a pre-specified agent-specific bound. For the MMDP, we assume the state dynamics to be controlled by the joint actions of agents, but the per-stage costs to only depend on the individual agent's actions. We combine the Q-learning algorithm for a weighted combination of the costs of each agent, obtained by a gossip algorithm with the Metropolis-Hastings or Multiplicative Weights formalisms to modulate the averaging matrix of the gossip. We use multiple timescales in our algorithm and prove that under mild conditions, it approximately achieves the desired bounds for each of the agents. We also demonstrate the empirical performance of this algorithm in the more general setting of MMDPs having jointly controlled per-stage costs.


Explainable Multi-Agent Reinforcement Learning for Temporal Queries

Boggess, Kayla, Kraus, Sarit, Feng, Lu

arXiv.org Artificial Intelligence

As multi-agent reinforcement learning (MARL) systems are increasingly deployed throughout society, it is imperative yet challenging for users to understand the emergent behaviors of MARL agents in complex environments. This work presents an approach for generating policy-level contrastive explanations for MARL to answer a temporal user query, which specifies a sequence of tasks completed by agents with possible cooperation. The proposed approach encodes the temporal query as a PCTL logic formula and checks if the query is feasible under a given MARL policy via probabilistic model checking. Such explanations can help reconcile discrepancies between the actual and anticipated multi-agent behaviors. The proposed approach also generates correct and complete explanations to pinpoint reasons that make a user query infeasible. We have successfully applied the proposed approach to four benchmark MARL domains (up to 9 agents in one domain). Moreover, the results of a user study show that the generated explanations significantly improve user performance and satisfaction.


Accelerating Representation Learning with View-Consistent Dynamics in Data-Efficient Reinforcement Learning

Huang, Tao, Wang, Jiachen, Chen, Xiao

arXiv.org Artificial Intelligence

Learning informative representations from image-based observations is of fundamental concern in deep Reinforcement Learning (RL). However, data-inefficiency remains a significant barrier to this objective. To overcome this obstacle, we propose to accelerate state representation learning by enforcing view-consistency on the dynamics. Firstly, we introduce a formalism of Multi-view Markov Decision Process (MMDP) that incorporates multiple views of the state. Following the structure of MMDP, our method, View-Consistent Dynamics (VCD), learns state representations by training a view-consistent dynamics model in the latent space, where views are generated by applying data augmentation to states. Empirical evaluation on DeepMind Control Suite and Atari-100k demonstrates VCD to be the SoTA data-efficient algorithm on visual control tasks.


Model-based Multi-Agent Reinforcement Learning with Cooperative Prioritized Sweeping

Bargiacchi, Eugenio, Verstraeten, Timothy, Roijers, Diederik M., Nowé, Ann

arXiv.org Artificial Intelligence

We present a new model-based reinforcement learning algorithm, Cooperative Prioritized Sweeping, for efficient learning in multi-agent Markov decision processes. The algorithm allows for sample-efficient learning on large problems by exploiting a factorization to approximate the value function. Our approach only requires knowledge about the structure of the problem in the form of a dynamic decision network. Using this information, our method learns a model of the environment and performs temporal difference updates which affect multiple joint states and actions at once. Batch updates are additionally performed which efficiently back-propagate knowledge throughout the factored Q-function. Our method outperforms the state-of-the-art algorithm sparse cooperative Q-learning algorithm, both on the well-known SysAdmin benchmark and randomized environments.


Planning under Uncertainty for Aggregated Electric Vehicle Charging Using Markov Decision Processes

Walraven, Erwin (Delft University of Technology) | Spaan, Matthijs T. J. (Delft University of Technology)

AAAI Conferences

The increasing penetration of renewable energy sources and electric vehicles raises important challenges related to the operation of electricity grids. For instance, the amount of power generated by wind turbines is time-varying and dependent on the weather, which makes it hard to match flexible electric vehicle demand and uncertain wind power supply. In this paper we propose a vehicle aggregation framework which uses Markov Decision Processes to control charging of multiple electric vehicles and deals with uncertainty in renewable supply. We present a grouping technique to address the scalability aspects of our framework. In experiments we show that the aggregation framework maximizes the profit of the aggregator while reducing usage of conventionally-generated power and cost of customers.


Agile Planning for Real-World Disaster Response

Wu, Feng (University of Science and Technology of China) | Ramchurn, Sarvapali D. (University of Southampton) | Jiang, Wenchao (University of Nottingham) | Fischer, Jeol E. (University of Nottingham) | Rodden, Tom (University of Nottingham) | Jennings, Nicholas R. (University of Southampton)

AAAI Conferences

However, as pointed out by [Moran et al., 2013], such We consider a setting where an agent-based planner assumptions simply do not hold in reality. The environment instructs teams of human emergency responders to is typically prone to significant uncertainties and humans may perform tasks in the real world. Due to uncertainty reject plans suggested by a software agent if they are tired or in the environment and the inability of the planner prefer to work with specific partners. Now, a naïve solution to consider all human preferences and all attributes to this would involve re-planning every time a rejection is of the real-world, humans may reject plans received. However, this may instead result in a high computational computed by the agent. A naïve solution that replans cost (as a whole new plan needs to be computed for given a rejection is inefficient and does not the whole team), may generate a plan that is still not acceptable, guarantee the new plan will be acceptable. Hence, and, following multiple rejection/replanning cycles (as we propose a new model re-planning problem using all individual team members need to accept the new plan), a Multi-agent Markov Decision Process that may lead the teams to suboptimal solutions.


Best-Response Planning of Thermostatically Controlled Loads under Power Constraints

Nijs, Frits de (Delft University of Technology) | Spaan, Matthijs T. J. (Delft University of Technology) | Weerdt, Mathijs M. de (Delft University of Technology)

AAAI Conferences

Renewable power sources such as wind and solar are inflexible in their energy production, which requires demand to rapidly follow supply in order to maintain energy balance. Promising controllable demands are air-conditioners and heat pumps which use electric energy to maintain a temperature at a setpoint. Such Thermostatically Controlled Loads (TCLs) have been shown to be able to follow a power curve using reactive control. In this paper we investigate the use of planning under uncertainty to pro-actively control an aggregation of TCLs to overcome temporary grid imbalance. We present a formal definition of the planning problem under consideration, which we model using the Multi-Agent Markov Decision Process (MMDP) framework. Since we are dealing with hundreds of agents, solving the resulting MMDPs directly is intractable. Instead, we propose to decompose the problem by decoupling the interactions through arbitrage. Decomposition of the problem means relaxing the joint power consumption constraint, which means that joining the plans together can cause overconsumption. Arbitrage acts as a conflict resolution mechanism during policy execution, using the future expected value of policies to determine which TCLs should receive the available energy. We experimentally compare several methods to plan with arbitrage, and conclude that a best response-like mechanism is a scalable approach that returns near-optimal solutions.