Agents
An Optimal Procedure to Check Pareto-Optimality in House Markets with Single-Peaked Preferences
Beynier, Aurélie, Maudet, Nicolas, Rey, Simon, Shams, Parham
Recently, the problem of allocating one resource per agent with initial endowments (house markets) has seen a renewed interest: indeed, while in the domain of strict preferences the Top Trading Cycle algorithm is known to be the only procedure guaranteeing Pareto-optimality, individual rationality, and strategy proofness. However, the situation differs in the single-peaked domain. Indeed, Bade presented the Crawler, an alternative procedure enjoying the same properties, with the additional advantage of being implementable in obviously dominant strategies. In this paper we further investigate the Crawler and propose the Diver, a variant which checks optimally whether an allocation is Pareto-optimal for single-peaked preferences, thus improving over known techniques used for checking Pareto-optimality in more general domains. We also prove that the Diver is asymptotically optimal in terms of communication complexity.
Jelly Bean World: A Testbed for Never-Ending Learning
Platanios, Emmanouil Antonios, Saparov, Abulhair, Mitchell, Tom
Machine learning has shown growing success in recent years. However, current machine learning systems are highly specialized, trained for particular problems or domains, and typically on a single narrow dataset. Human learning, on the other hand, is highly general and adaptable. Never-ending learning is a machine learning paradigm that aims to bridge this gap, with the goal of encouraging researchers to design machine learning systems that can learn to perform a wider variety of inter-related tasks in more complex environments. To date, there is no environment or testbed to facilitate the development and evaluation of never-ending learning systems. To this end, we propose the Jelly Bean World testbed. The Jelly Bean World allows experimentation over two-dimensional grid worlds which are filled with items and in which agents can navigate. This testbed provides environments that are sufficiently complex and where more generally intelligent algorithms ought to perform better than current state-of-the-art reinforcement learning approaches. It does so by producing non-stationary environments and facilitating experimentation with multi-task, multi-agent, multi-modal, and curriculum learning settings. We hope that this new freely-available software will prompt new research and interest in the development and evaluation of never-ending learning systems and more broadly, general intelligence systems.
On State Variables, Bandit Problems and POMDPs
State variables are easily the most subtle dimension of sequential decision problems. This is especially true in the context of active learning problems (bandit problems") where decisions affect what we observe and learn. We describe our canonical framework that models {\it any} sequential decision problem, and present our definition of state variables that allows us to claim: Any properly modeled sequential decision problem is Markovian. We then present a novel two-agent perspective of partially observable Markov decision problems (POMDPs) that allows us to then claim: Any model of a real decision problem is (possibly) non-Markovian. We illustrate these perspectives using the context of observing and treating flu in a population, and provide examples of all four classes of policies in this setting. We close with an indication of how to extend this thinking to multiagent problems.
Fast Complete Algorithm for Multiplayer Nash Equilibrium
Nash equilibrium is the central solution concept in game theory. While a Nash equilibrium can be computed in polynomial time for two-player zero-sum games, it is PPAD-hard for two-player general-sum and multiplayer games and widely believed that no efficient algorithms exist [7, 8, 9]. Furthermore, even if we were able to compute an equilibrium for these game classes, it would have no performance guarantee.
Extended Markov Games to Learn Multiple Tasks in Multi-Agent Reinforcement Learning
León, Borja G., Belardinelli, Francesco
This paper focus on formally extending Markov Learning (RL) has recently attracted interest as a way for singleagent Games (MGs), the mathematical model that is traditionally used in RL to learn multiple-task specifications. In this paper we extend MARL, to build a new general model, i.e, not focused solely in one this convergence to multi-agent settings and formally define Extended kind of multi-agent game, that allows multiple learning agents to Markov Games as a general mathematical model that allows concurrently fulfill various non-Markovian specifications in multiagent multiple RL agents to concurrently learn various non-Markovian settings. To support our model with empirical evidence, we specifications. To introduce this new model we provide formal definitions also extended two logic-based RL algorithms to multi-agents systems and proofs as well as empirical tests of RL algorithms running in order to show how various learning agents can fulfill different on this framework. Specifically, we use our model to train two different types of non-Markovian specifications expressed in co-safe- Lineartime logic-based multi-agent RL algorithms to solve diverse settings Temporal Logic (LT L). Our results are promising and point to of non-Markovian co-safe LT L specifications.
On the Sensory Commutativity of Action Sequences for Embodied Agents
Caselles-Dupré, Hugo, Garcia-Ortiz, Michael, Filliat, David
We study perception in the scenario of an embodied agent equipped with first-person sensors and a continuous motor space with multiple degrees of freedom. Inspired by two theories of perception in artificial agents (Higgins et al., 2018; Poincaré, 1895) we consider theoretically the commutation properties of action sequences with respect to sensory information perceived by such embodied agent. From the theoretical derivations, we define the Sensory Commutativity Probability criterion which measures how much an agent's degree of freedom affects the environment in embodied scenarios. We empirically illustrate how it can be used to improve sample-efficiency in Reinforcement Learning.
The Big Three: A Methodology to Increase Data Science ROI by Answering the Questions Companies Care About
Companies may be achieving only a third of the value they could be getting from data science in industry applications. In this paper, we propose a methodology for categorizing and answering 'The Big Three' questions (what is going on, what is causing it, and what actions can I take that will optimize what I care about) using data science. The applications of data science seem to be nearly endless in today's modern landscape, with each company jockeying for position in the new data and insights economy. Yet, data scientists seem to be solely focused on using classification, regression, and clustering methods to answer the question 'what is going on'. Answering questions about why things are happening or how to take optimal actions to improve metrics are relegated to niche fields of research and generally neglected in industry data science analysis. We survey technical methods to answer these other important questions, describe areas in which some of these methods are being applied, and provide a practical example of how to apply our methodology and selected methods to a real business use case.
Cooperative Observation of Targets moving over a Planar Graph with Prediction of Positions
Maia, José E. B., Figueredo, Levi P.
Consider a team with two types of agents: targets and observers. Observers are aerial UAVs that observe targets moving on land with their movements restricted to the paths that form a planar graph on the surface. Observers have limited range of vision and targets do not avoid observers. The objective is to maximize the integral of the number of targets observed in the observation interval. Taking advantage of the fact that the future positions of targets in the short term are predictable, we show in this article a modified hill climbing algorithm that surpasses its previous versions in this new setting of the CTO problem.
Connectivity-driven Communication in Multi-agent Reinforcement Learning through Diffusion Processes on Graphs
Pesce, Emanuele, Montana, Giovanni
We discuss the problem of learning collaborative behaviour in multi-agent systems using deep reinforcement learning (DRL). A connectivity-driven communication (CDC) algorithm is proposed to address three key aspects: what agents to involve in the communication, what information content to share, and how often to share it. We introduce the notion of a connectivity network, modelled as a weighted graph, where nodes represent agents and edges represent the degree of connectivity between pairs of agents. The optimal graph topology is learned end-to-end concurrently with the stochastic policy so as to maximise future expected returns. The communication patterns depend on the graph's topology through a diffusion process on the graph, the heat kernel, which is found by exponentiating the Laplacian eigensystem through time and is fully differentiable. Empirical results show that CDC is capable of superior performance over alternative algorithms for a range of cooperative navigation tasks.
Intrinsic Motivation for Encouraging Synergistic Behavior
Chitnis, Rohan, Tulsiani, Shubham, Gupta, Saurabh, Gupta, Abhinav
We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks, which are tasks where multiple agents must work together to achieve a goal they could not individually. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own. Thus, we propose to incentivize agents to take (joint) actions whose effects cannot be predicted via a composition of the predicted effect for each individual agent. We study two instantiations of this idea, one based on the true states encountered, and another based on a dynamics model trained concurrently with the policy. While the former is simpler, the latter has the benefit of being analytically differentiable with respect to the action taken. We validate our approach in robotic bimanual manipulation and multi-agent locomotion tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior.