Agents
The NetHack Learning Environment
Küttler, Heinrich, Nardelli, Nantas, Miller, Alexander H., Raileanu, Roberta, Selvatici, Marco, Grefenstette, Edward, Rocktäschel, Tim
Progress in Reinforcement Learning (RL) algorithms goes hand-in-hand with the development of challenging environments that test the limits of current methods. While existing RL environments are either sufficiently complex or based on fast simulation, they are rarely both. Here, we present the NetHack Learning Environment (NLE), a scalable, procedurally generated, stochastic, rich, and challenging environment for RL research based on the popular single-player terminal-based roguelike game, NetHack. We argue that NetHack is sufficiently complex to drive long-term research on problems such as exploration, planning, skill acquisition, and language-conditioned RL, while dramatically reducing the computational resources required to gather a large amount of experience. We compare NLE and its task suite to existing alternatives, and discuss why it is an ideal medium for testing the robustness and systematic generalization of RL agents. We demonstrate empirical success for early stages of the game using a distributed Deep RL baseline and Random Network Distillation exploration, alongside qualitative analysis of various agents trained in the environment. NLE is open source at https://github.com/facebookresearch/nle.
Preferences Single-Peaked on a Circle
Peters, Dominik | Lackner, Martin (TU Wien)
We introduce the domain of preferences that are single-peaked on a circle, which is a generalization of the well-studied single-peaked domain. This preference restriction is useful, e.g., for scheduling decisions, certain facility location problems, and for one-dimensional decisions in the presence of extremist preferences. We give a fast recognition algorithm of this domain, provide a characterisation by finitely many forbidden subprofiles, and show that many popular single- and multi-winner voting rules are polynomial-time computable on this domain. In particular, we prove that Proportional Approval Voting can be computed in polynomial time for profiles that are single-peaked on a circle. In contrast, Kemeny's rule remains hard to evaluate, and several impossibility results from social choice theory can be proved using only profiles in this domain.
Towards Understanding Linear Value Decomposition in Cooperative Multi-Agent Q-Learning
Wang, Jianhao, Ren, Zhizhou, Han, Beining, Zhang, Chongjie
Value decomposition is a popular and promising approach to scaling up multi-agent reinforcement learning in cooperative settings. However, the theoretical understanding of such methods is limited. In this paper, we introduce a variant of the fitted Q-iteration framework for analyzing multi-agent Q-learning with value decomposition. Based on this framework, we derive a closed-form solution to the Bellman error minimization with linear value decomposition. With this novel solution, we further reveal two interesting insights: 1) linear value decomposition implicitly implements a classical multi-agent credit assignment called counterfactual difference rewards; and 2) multi-agent Q-learning with linear value decomposition requires on-policy data distribution to achieve numerical stability. In the empirical study, our experiments demonstrate the realizability of our theoretical implications in a broad set of complicated tasks. They show that most state-of-the-art deep multi-agent Q-learning algorithms using linear value decomposition cannot efficiently utilize off-policy samples, which may even lead to an unbounded divergence.
Envy-freeness up to one item: Shall we add or remove resources?
We consider a fair division model in which agents have general valuations for bundles of indivisible items. We propose two new axiomatic properties for allocations in this model: EF1+- and EFX+-. We compare these with the existing EF1 and EFX. Although EF1 and EF1+- allocations often exist, our results assert eloquently that EFX+- and PO allocations exist in each case where EFX and PO allocations do not exist. Additionally, we prove several new impossibility and incompatibility results.
Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes
Tian, Yi, Qian, Jian, Sra, Suvrit
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational complexity with a slightly worse regret. A key new ingredient of our algorithms is the design of a bonus term to guide exploration. We complement our algorithms by presenting several structure-dependent lower bounds on regret for FMDPs that reveal the difficulty hiding in the intricacy of the structures.
Fanoos: Multi-Resolution, Multi-Strength, Interactive Explanations for Learned Systems
Machine learning becomes increasingly important to tune or even synthesize the behavior of safety-critical components in highly non-trivial environments, where the inability to understand learned components in general, and neural nets in particular, poses serious obstacles to their adoption. Explainability and interpretability methods for learned systems have gained considerable academic attention, but the focus of current approaches on only one aspect of explanation, at a fixed level of abstraction, and limited if any formal guarantees, prevents those explanations from being digestible by the relevant stakeholders (e.g., end users, certification authorities, engineers) with their diverse backgrounds and situation-specific needs. We introduce Fanoos, a flexible framework for combining formal verification techniques, heuristic search, and user interaction to explore explanations at the desired level of granularity and fidelity. We demonstrate the ability of Fanoos to produce and adjust the abstractness of explanations in response to user requests on a learned controller for an inverted double pendulum and on a learned CPU usage model.
Evolutionary Processes in Quantum Decision Theory
In recent years, there has appeared high interest to the possibility of formulating decision theory in the language of quantum mechanics. Numerous references on this topic can be found in the books [1-4] and review articles [5-8]. This interest is caused by the inability of classical decision theory [9] to comply with the behaviour of real decision makers, which requires to develop other approaches. Resorting to the techniques of quantum theory gives hopes for a better representation of behavioral decision making. There are several variants of using quantum mechanics for interpreting conscious effects.
Emergent cooperation through mutual information maximization
Cuervo, Santiago, Alzate, Marco
With artificial intelligence systems becoming ubiquitous in our society, its designers will soon have to start to consider its social dimension, as many of these systems will have to interact among them to work efficiently. With this in mind, we propose a decentralized deep reinforcement learning algorithm for the design of cooperative multi-agent systems. The algorithm is based on the hypothesis that highly correlated actions are a feature of cooperative systems, and hence, we propose the insertion of an auxiliary objective of maximization of the mutual information between the actions of agents in the learning problem. Our system is applied to a social dilemma, a problem whose optimal solution requires that agents cooperate to maximize a macroscopic performance function despite the divergent individual objectives of each agent. By comparing the performance of the proposed system to a system without the auxiliary objective, we conclude that the maximization of mutual information among agents promotes the emergence of cooperation in social dilemmas.
Breaking the Curse of Many Agents: Provable Mean Embedding Q-Iteration for Mean-Field Reinforcement Learning
Wang, Lingxiao, Yang, Zhuoran, Wang, Zhaoran
Multi-agent reinforcement learning (MARL) achieves significant empirical successes. However, MARL suffers from the curse of many agents. In this paper, we exploit the symmetry of agents in MARL. In the most generic form, we study a mean-field MARL problem. Such a mean-field MARL is defined on mean-field states, which are distributions that are supported on continuous space. Based on the mean embedding of the distributions, we propose MF-FQI algorithm that solves the mean-field MARL and establishes a non-asymptotic analysis for MF-FQI algorithm. We highlight that MF-FQI algorithm enjoys a "blessing of many agents" property in the sense that a larger number of observed agents improves the performance of MF-FQI algorithm.
A Primer on Zeroth-Order Optimization in Signal Processing and Machine Learning
Liu, Sijia, Chen, Pin-Yu, Kailkhura, Bhavya, Zhang, Gaoyuan, Hero, Alfred, Varshney, Pramod K.
Zeroth-order (ZO) optimization is a subset of gradient-free optimization that emerges in many signal processing and machine learning applications. It is used for solving optimization problems similarly to gradient-based methods. However, it does not require the gradient, using only function evaluations. Specifically, ZO optimization iteratively performs three major steps: gradient estimation, descent direction computation, and solution update. In this paper, we provide a comprehensive review of ZO optimization, with an emphasis on showing the underlying intuition, optimization principles and recent advances in convergence analysis. Moreover, we demonstrate promising applications of ZO optimization, such as evaluating robustness and generating explanations from black-box deep learning models, and efficient online sensor management.