Goto

Collaborating Authors

 Agents


Multi-Agent Low-Dimensional Linear Bandits

arXiv.org Machine Learning

We study a multi-agent stochastic linear bandit with side information, parameterized by an unknown vector $\theta^* \in \mathbb{R}^d$. The side information consists of a finite collection of low-dimensional subspaces, one of which contains $\theta^*$. In our setting, agents can collaborate to reduce regret by sending recommendations across a communication graph connecting them. We present a novel decentralized algorithm, where agents communicate subspace indices with each other, and each agent plays a projected variant of LinUCB on the corresponding (low-dimensional) subspace. Through a combination of collaborative best subspace identification, and per-agent learning of an unknown vector in the corresponding low-dimensional subspace, we show that the per-agent regret is much smaller than the case when agents do not communicate. By collaborating to identify the subspace containing $\theta^*$, we show that each agent effectively solves an easier instance of the linear bandit (compared to the case of no collaboration), thus leading to the reduced per-agent regret. We finally complement these results through simulations.


Maximin Share Allocations on Cycles

Journal of Artificial Intelligence Research

The problem of fair division of indivisible goods is a fundamental problem of resource allocation in multi-agent systems, also studied extensively in social choice. Recently, the problem was generalized to the case when goods form a graph and the goal is to allocate goods to agents so that each agent's bundle forms a connected subgraph. For the maximin share fairness criterion, researchers proved that if goods form a tree, an allocation offering each agent a bundle of at least her maximin share value always exists. Moreover, it can be found in polynomial time. In this paper we consider the problem of maximin share allocations of goods on a cycle. Despite the simplicity of the graph, the problem turns out to be significantly harder than its tree version. We present cases when maximin share allocations of goods on cycles exist and provide in this case results on allocations guaranteeing each agent a certain fraction of her maximin share. We also study algorithms for computing maximin share allocations of goods on cycles.


Multitask Bandit Learning through Heterogeneous Feedback Aggregation

arXiv.org Machine Learning

Online multi-armed bandit learning has many important real-world applications (see Villar et al., 2015; Shen et al., 2015; Li et al., 2010, for a few examples). In practice, a group of online bandit learning agents are often deployed for similar tasks, and they learn to perform these tasks in similar yet nonidentical environments. For example, a group of assistive healthcare robots may be deployed to provide personalized cognitive training to people with dementia (PwD), e.g., by playing cognitive training games with people (Kubota et al., 2020). Each robot seeks to learn the preferences of its paired PwD so as to recommend tailored health intervention based on how the PwD reacts to and is engaged with the activities (as captured by sensors on the robots) (Kubota et al., 2020). As PwD may have similar preferences and may therefore exhibit similar reactions, one natural question arises--can the robots as a multi-agent system learn to perform their respective tasks faster through collaboration? In this paper, we develop multi-agent bandit learning algorithms where each agent can robustly aggregate data from other agents to better perform its respective task. We generalize the the multi-armed bandit problem (Auer et al., 2002) and formulate the ษ›-Multi-Player Multi-Armed Bandit (ษ›-MPMAB) problem, which models heterogeneous multitask learning in a multi-agent bandit learning setting. In an ษ›-MPMAB problem instance, a set of M players are deployed to perform similar tasks--simultaneously they interact with a set of actions/arms, and for each arm, different players receive feedback from similar but not necessarily identical reward distributions. In the above assistive robotics example, each player corresponds to a robot; each arm corresponds to one of the cognitive activities to choose from; for each player and each arm, there is a separate reward distribution which reflects a PwD's


Exploring Zero-Shot Emergent Communication in Embodied Multi-Agent Populations

arXiv.org Artificial Intelligence

Effective communication is an important skill for enabling information exchange and cooperation in multi-agent settings. Indeed, emergent communication is now a vibrant field of research, with common settings involving discrete cheap-talk channels. One limitation of this setting is that it does not allow for the emergent protocols to generalize beyond the training partners. Furthermore, so far emergent communication has primarily focused on the use of symbolic channels. In this work, we extend this line of work to a new modality, by studying agents that learn to communicate via actuating their joints in a 3D environment. We show that under realistic assumptions, a non-uniform distribution of intents and a common-knowledge energy cost, these agents can find protocols that generalize to novel partners. We also explore and analyze specific difficulties associated with finding these solutions in practice. Finally, we propose and evaluate initial training improvements to address these challenges, involving both specific training curricula and providing the latent feature that can be coordinated on during training.


A Framework for Learning Predator-prey Agents from Simulation to Real World

arXiv.org Artificial Intelligence

In this paper, we propose an evolutionary predatorprey robot system which can be generally implemented from simulation to the real world. We design the closed-loop robot system with camera and infrared sensors as inputs of controller. Both the predators and prey are co-evolved by NeuroEvolution of Augmenting Topologies (NEAT) to learn the expected behaviours. We design a framework that integrate Gym of OpenAI, Robot Operating System (ROS), Gazebo. In such a framework, users only need to focus on algorithms without being worried about the detail of manipulating robots in both simulation and the real world. Combining simulations, real-world evolution, and robustness analysis, it can be applied to develop the solutions for the predator-prey tasks. For the convenience of users, the source code and videos of the simulated and real world are published on Github.


Low-Variance Policy Gradient Estimation with World Models

arXiv.org Artificial Intelligence

In this paper, we propose World Model Policy Gradient (WMPG), an approach to reduce the variance of policy gradient estimates using learned world models (WM's). In WMPG, a WM is trained online and used to imagine trajectories. The imagined trajectories are used in two ways. Firstly, to calculate a without-replacement estimator of the policy gradient. Secondly, the return of the imagined trajectories is used as an informed baseline. We compare the proposed approach with AC and MAC on a set of environments of increasing complexity (CartPole, LunarLander and Pong) and find that WMPG has better sample efficiency. Based on these results, we conclude that WMPG can yield increased sample efficiency in cases where a robust latent representation of the environment can be learned.


The Brain is like a Computer is a Terrible Metaphor

#artificialintelligence

"The brain is a computer" is a damn problematic metaphor. I prefer to say that "the brain is an intuition machine". The term computer is conventionally understood to be a digital computer. It's the kind that is designed by minds and manufactured in assembly lines. It is a horrible metaphor.


Learning Strategies in Decentralized Matching Markets under Uncertain Preferences

arXiv.org Machine Learning

We study two-sided decentralized matching markets in which participants have uncertain preferences. We present a statistical model to learn the preferences. The model incorporates uncertain state and the participants' competition on one side of the market. We derive an optimal strategy that maximizes the agent's expected payoff and calibrate the uncertain state by taking the opportunity costs into account. We discuss the sense in which the matching derived from the proposed strategy has a stability property. We also prove a fairness property that asserts that there exists no justified envy according to the proposed strategy. We provide numerical results to demonstrate the improved payoff, stability and fairness, compared to alternative methods.


Learning to Represent Action Values as a Hypergraph on the Action Vertices

arXiv.org Machine Learning

Action-value estimation is a critical component of many reinforcement learning (RL) methods whereby sample complexity relies heavily on how fast a good estimator for action value can be learned. By viewing this problem through the lens of representation learning, good representations of both state and action can facilitate action-value estimation. While advances in deep learning have seamlessly driven progress in learning state representations, given the specificity of the notion of agency to RL, little attention has been paid to learning action representations. We conjecture that leveraging the combinatorial structure of multi-dimensional action spaces is a key ingredient for learning good representations of action. To test this, we set forth the action hypergraph networks framework---a class of functions for learning action representations with a relational inductive bias. Using this framework we realise an agent class based on a combination with deep Q-networks, which we dub hypergraph Q-networks. We show the effectiveness of our approach on a myriad of domains: illustrative prediction problems under minimal confounding effects, Atari 2600 games, and physical control benchmarks.


Bayesian Algorithms for Decentralized Stochastic Bandits

arXiv.org Machine Learning

We study a decentralized cooperative multi-agent multi-armed bandit problem with $K$ arms and $N$ agents connected over a network. In our model, each arm's reward distribution is same for all agents, and rewards are drawn independently across agents and over time steps. In each round, agents choose an arm to play and subsequently send a message to their neighbors. The goal is to minimize cumulative regret averaged over the entire network. We propose a decentralized Bayesian multi-armed bandit framework that extends single-agent Bayesian bandit algorithms to the decentralized setting. Specifically, we study an information assimilation algorithm that can be combined with existing Bayesian algorithms, and using this, we propose a decentralized Thompson Sampling algorithm and decentralized Bayes-UCB algorithm. We analyze the decentralized Thompson Sampling algorithm under Bernoulli rewards and establish a problem-dependent upper bound on the cumulative regret. We show that regret incurred scales logarithmically over the time horizon with constants that match those of an optimal centralized agent with access to all observations across the network. Our analysis also characterizes the cumulative regret in terms of the network structure. Through extensive numerical studies, we show that our extensions of Thompson Sampling and Bayes-UCB incur lesser cumulative regret than the state-of-art algorithms inspired by the Upper Confidence Bound algorithm. We implement our proposed decentralized Thompson Sampling under gossip protocol, and over time-varying networks, where each communication link has a fixed probability of failure.