Agents
Towards Learning Multi-agent Negotiations via Self-Play
Making sophisticated, robust, and safe sequential decisions is at the heart of intelligent systems. This is especially critical for planning in complex multi-agent environments, where agents need to anticipate other agents' intentions and possible future actions. Traditional methods formulate the problem as a Markov Decision Process, but the solutions often rely on various assumptions and become brittle when presented with corner cases. In contrast, deep reinforcement learning (Deep RL) has been very effective at finding policies by simultaneously exploring, interacting, and learning from environments. Leveraging the powerful Deep RL paradigm, we demonstrate that an iterative procedure of self-play can create progressively more diverse environments, leading to the learning of sophisticated and robust multi-agent policies. W e demonstrate this in a challenging multi-agent simulation of merging traffic, where agents must interact and negotiate with others in order to successfully merge on or off the road. While the environment starts off simple, we increase its complexity by iteratively adding an increasingly diverse set of agents to the agent "zoo" as training progresses. Qualitatively, we find that through self-play, our policies automatically learn interesting behaviors such as defensive driving, overtaking, yielding, and the use of signal lights to communicate intentions to other agents. In addition, quantitatively, we show a dramatic improvement of the success rate of merging maneuvers from 63% to over 98%.
Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems
Asghari, Seyed Mohammad, Ouyang, Yi, Nayyar, Ashutosh
Regret analysis is challenging in Multi-Agent Reinforcement Learning (MARL) primarily due to the dynamical environments and the decentralized information among agents. We attempt to solve this challenge in the context of decentralized learning in multi-agent linear-quadratic (LQ) dynamical systems. We begin with a simple setup consisting of two agents and two dynamically decoupled stochastic linear systems, each system controlled by an agent. The systems are coupled through a quadratic cost function. When both systems' dynamics are unknown and there is no communication among the agents, we show that no learning policy can generate sub-linear in $T$ regret, where $T$ is the time horizon. When only one system's dynamics are unknown and there is one-directional communication from the agent controlling the unknown system to the other agent, we propose a MARL algorithm based on the construction of an auxiliary single-agent LQ problem. The auxiliary single-agent problem in the proposed MARL algorithm serves as an implicit coordination mechanism among the two learning agents. This allows the agents to achieve a regret within $O(\sqrt{T})$ of the regret of the auxiliary single-agent problem. Consequently, using existing results for single-agent LQ regret, our algorithm provides a $\tilde{O}(\sqrt{T})$ regret bound. (Here $\tilde{O}(\cdot)$ hides constants and logarithmic factors). Our numerical experiments indicate that this bound is matched in practice. From the two-agent problem, we extend our results to multi-agent LQ systems with certain communication patterns.
COKE: Communication-Censored Kernel Learning for Decentralized Non-parametric Learning
Xu, Ping, Wang, Yue, Chen, Xiang, Zhi, Tian
This paper studies the decentralized optimization and learning problem where multiple interconnected agents aim to learn an optimal decision function defined over a reproducing kernel Hilbert (RKH) space by jointly minimizing a global objective function, with access to locally observed data only. As a non-parametric approach, kernel learning faces a major challenge in distributed implementation: the decision variables of local objective functions are data-dependent with different sizes and thus cannot be optimized under the decentralized consensus framework without any raw data exchange among agents. To circumvent this major challenge and preserve data privacy, we leverage the random feature (RF) approximation approach to map the large-volume data represented in the RKH space into a smaller RF space, which facilitates the same-size parameter exchange and enables distributed agents to reach consensus on the function decided by the parameters in the RF space. For fast convergent implementation, we design an iterative algorithm for Decentralized Kernel Learning via Alternating direction method of multipliers (DKLA). Further, we develop a COmmunication-censored KErnel learning (COKE) algorithm to reduce the communication load in DKLA. To do so, we apply a communication-censoring strategy, which prevents an agent from transmitting at every iteration unless its local updates are deemed informative. Theoretical results in terms of linear convergence guarantee and generalization performance analysis of DKLA and COKE are provided. Comprehensive tests with both synthetic and real datasets are conducted to verify the communication efficiency and learning effectiveness of COKE.
Reinforcement Learning-based Autoscaling of Workflows in the Cloud: A Survey
Garí, Yisel, Monge, David A., Pacini, Elina, Mateos, Cristian, Garino, Carlos García
Reinforcement Learning (RL) has demonstrated a great potential for automatically solving decision making problems in complex uncertain environments. Basically, RL proposes a computational approach that allows learning through interaction in an environment of stochastic behavior, with agents taking actions to maximize some cumulative short-term and long-term rewards. Some of the most impressive results have been shown in Game Theory where agents exhibited super-human performance in games like Go or Starcraft 2, which led to its adoption in many other domains including Cloud Computing. Particularly, workflow autoscaling exploits the Cloud elasticity to optimize the execution of workflows according to a given optimization criteria. This is a decision-making problem in which it is necessary to establish when and how to scale-up/down computational resources; and how to assign them to the upcoming processing workload. Such actions have to be taken considering some optimization criteria in the Cloud, a dynamic and uncertain environment. Motivated by this, many works apply RL to the autoscaling problem in Cloud. In this work we survey exhaustively those proposals from major venues, and uniformly compare them based on a set of proposed taxonomies. We also discuss open problems and provide a prospective of future research in the area.
Uncertainty-based Modulation for Lifelong Learning
Brna, Andrew, Brown, Ryan, Connolly, Patrick, Simons, Stephen, Shimizu, Renee, Aguilar-Simon, Mario
The creation of machine learning algorithms for intelligent agents capable of continuous, lifelong learning is a critical objective for algorithms being deployed on real-life systems in dynamic environments. Here we present an algorithm inspired by neuromodulatory mechanisms in the human brain that integrates and expands upon Stephen Grossberg\'s ground-breaking Adaptive Resonance Theory proposals. Specifically, it builds on the concept of uncertainty, and employs a series of neuromodulatory mechanisms to enable continuous learning, including self-supervised and one-shot learning. Algorithm components were evaluated in a series of benchmark experiments that demonstrate stable learning without catastrophic forgetting. We also demonstrate the critical role of developing these systems in a closed-loop manner where the environment and the agent\'s behaviors constrain and guide the learning process. To this end, we integrated the algorithm into an embodied simulated drone agent. The experiments show that the algorithm is capable of continuous learning of new tasks and under changed conditions with high classification accuracy (greater than 94 percent) in a virtual environment, without catastrophic forgetting. The algorithm accepts high dimensional inputs from any state-of-the-art detection and feature extraction algorithms, making it a flexible addition to existing systems. We also describe future development efforts focused on imbuing the algorithm with mechanisms to seek out new knowledge as well as employ a broader range of neuromodulatory processes.
Emergent behavior by minimizing chaos
All living organisms carve out environmental niches within which they can maintain relative predictability amidst the ever-increasing entropy around them (1), (2). Humans, for example, go to great lengths to shield themselves from surprise -- we band together in millions to build cities with homes, supplying water, food, gas, and electricity to control the deterioration of our bodies and living spaces amidst heat and cold, wind and storm. The need to discover and maintain such surprise-free equilibria has driven great resourcefulness and skill in organisms across very diverse natural habitats. Motivated by this, we ask: could the motive of preserving order amidst chaos guide the automatic acquisition of useful behaviors in artificial agents? This central problem in artificial intelligence has evoked several candidate solutions, largely focusing on novelty-seeking behaviors (3), (4), (5).
Facebook AI gives maps the brushoff in helping robots find the way
Facebook has scored an impressive feat involving AI that can navigate without any map. Facebook's wish for bragging rights, although they said they have a way to go, were evident in its blog post, "Near-perfect point-goal navigation from 2.5 billion frames of experience." Long story short, Facebook has delivered an algorithm that, quoting MIT Technology Review, lets robots find the shortest route in unfamiliar environments, opening the door to robots that can work inside homes and offices." And, in line with the plain-and-simple, Ubergizmo's Tyler Lee also remarked: "Facebook believes that with this new algorithm, it will be capable of creating robots that can navigate an area without the need for maps...in theory, you could place a robot in a room or an area without a map and it should be able to find its way to its destination." Erik Wijmans and Abhishek Kadian in the Facebook Jan. 21 post said that, well, after all, one of the technology key challenges is "teaching these systems to navigate through complex, unfamiliar real-world environments to reach a specified destination--without a preprovided map." Facebook has taken on the challenge. The two announced that Facebook AI created a large-scale distributed reinforcement learning algorithm called DD-PPO, "which has effectively solved the task of point-goal navigation using only an RGB-D camera, GPS, and compass data," they wrote. DD-PPO stands for decentralized distributed proximal policy optimization. This is what Facebook is using to train agents and results seen in virtual environments such as houses and office buildings were encouraging. The bloggers pointed out that "even failing 1 out of 100 times is not acceptable in the physical world, where a robot agent might damage itself or its surroundings by making an error." Beyond DD-PPO, the authors gave credit to Facebook AI's open source AI Habitat platform for its "state-of-the-art speed and fidelity." AI Habitat made its open source announcement last year as a simulation platform to train embodied agents such as virtual robots in photo-realistic 3-D environments. Facebook said it was part of "Facebook AI's ongoing effort to create systems that are less reliant on large annotated data sets used for supervised training." InfoQ had said in July that "The technology was taking a different approach than relying upon static data sets which other researchers have traditionally used and that Facebook decided to open-source this technology to move this subfield forward." Jon Fingas in Engadget looked at how the team worked toward AI navigation (and this is where that 25 billion number comes in). "Previous projects tend to struggle without massive computational power.
Facebook speeds up AI training by culling the weak – TechCrunch
Training an artificial intelligence agent to do something like navigate a complex 3D world is computationally expensive and time-consuming. In order to better create these potentially useful systems, Facebook engineers derived huge efficiency benefits from, essentially, leaving the slowest of the pack behind. It's part of the company's new focus on "embodied AI," meaning machine learning systems that interact intelligently with their surroundings. That could mean lots of things -- responding to a voice command using conversational context, for instance, but also more subtle things like a robot knowing it has entered the wrong room of a house. Exactly why Facebook is so interested in that I'll leave to your own speculation, but the fact is they've recruited and funded serious researchers to look into this and related domains of AI work.
Silly rules improve the capacity of agents to learn stable enforcement and compliance behaviors
Köster, Raphael, Hadfield-Menell, Dylan, Hadfield, Gillian K., Leibo, Joel Z.
How can societies learn to enforce and comply with social norms? Here we investigate the learning dynamics and emergence of compliance and enforcement of social norms in a foraging game, implemented in a multi-agent reinforcement learning setting. In this spatiotemporally extended game, individuals are incentivized to implement complex berry-foraging policies and punish transgressions against social taboos covering specific berry types. We show that agents benefit when eating poisonous berries is taboo, meaning the behavior is punished by other agents, as this helps overcome a credit-assignment problem in discovering delayed health effects. Critically, however, we also show that introducing an additional taboo, which results in punishment for eating a harmless berry, improves the rate and stability with which agents learn to punish taboo violations and comply with taboos. Counterintuitively, our results show that an arbitrary taboo (a "silly rule") can enhance social learning dynamics and achieve better outcomes in the middle stages of learning. We discuss the results in the context of studying normativity as a group-level emergent phenomenon.
Learning Non-Markovian Reward Models in MDPs
Rens, Gavin, Raskin, Jean-François
There are situations in which an agent should receive rewards only after having accomplished a series of previous tasks. In other words, the reward that the agent receives is non-Markovian. One natural and quite general way to represent history-dependent rewards is via a Mealy machine; a finite state automaton that produces output sequences (rewards in our case) from input sequences (state/action observations in our case). In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves and a Mealy machine synchronised with this MDP to formalise the non-Markovian reward function. While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt. Learning non-Markov reward functions is a challenge. Our approach to overcome this challenging problem is a careful combination of the Angluin's L* active learning algorithm to learn finite automata, testing techniques for establishing conformance of finite model hypothesis and optimisation techniques for computing optimal strategies in Markovian (immediate) reward MDPs. We also show how our framework can be combined with classical heuristics such as Monte Carlo Tree Search. We illustrate our algorithms and a preliminary implementation on two typical examples for AI.