Despite the fact that deep reinforcement learning (RL) has surpassed human-level performances in various tasks, it still has several fundamental challenges such as extensive data requirement and lack of interpretability. We investigate the RL problem with non-Markovian reward functions to address such challenges. We enable an RL agent to extract high-level knowledge in the form of finite reward automata, a type of Mealy machines that encode non-Markovian reward functions. The finite reward automata can be converted to deterministic finite state machines, which can be further translated to regular expressions. Thus, this representation is more interpretable than other forms of knowledge representation such as neural networks. We propose an active learning approach that iteratively infers finite reward automata and performs RL (specifically, q-learning) based on the inferred finite reward automata. The inference method is inspired by the L* learning algorithm, and modified in the framework of RL. We maintain two different q-functions, one for answering the membership queries in the L* learning algorithm and the other one for obtaining optimal policies for the inferred finite reward automaton. The experiments show that the proposed approach converges to optimal policies in at most 50% of the training steps as in the two state-of-the-art baselines.
With artificial intelligence systems becoming ubiquitous in our society, its designers will soon have to start to consider its social dimension, as many of these systems will have to interact among them to work efficiently. With this in mind, we propose a decentralized deep reinforcement learning algorithm for the design of cooperative multi-agent systems. The algorithm is based on the hypothesis that highly correlated actions are a feature of cooperative systems, and hence, we propose the insertion of an auxiliary objective of maximization of the mutual information between the actions of agents in the learning problem. Our system is applied to a social dilemma, a problem whose optimal solution requires that agents cooperate to maximize a macroscopic performance function despite the divergent individual objectives of each agent. By comparing the performance of the proposed system to a system without the auxiliary objective, we conclude that the maximization of mutual information among agents promotes the emergence of cooperation in social dilemmas.
Computers have been used to analyze and create music since they were first introduced in the 1950s and 1960s. Beginning in the late 1990s, the rise of the Internet and large scale platforms for music recommendation and retrieval have made music an increasingly prevalent domain of machine learning and artificial intelligence research. While still nascent, several different approaches have been employed to tackle what may broadly be referred to as "musical intelligence." This article provides a definition of musical intelligence, introduces a taxonomy of its constituent components, and surveys the wide range of AI methods that can be, and have been, brought to bear in its pursuit, with a particular emphasis on machine learning methods.
We present a framework to address a class of sequential decision making problems. Our framework features learning the optimal control policy with robustness to noisy data, determining the unknown state and action parameters, and performing sensitivity analysis with respect to problem parameters. We consider two broad categories of sequential decision making problems modelled as infinite horizon Markov Decision Processes (MDPs) with (and without) an absorbing state. The central idea underlying our framework is to quantify exploration in terms of the Shannon Entropy of the trajectories under the MDP and determine the stochastic policy that maximizes it while guaranteeing a low value of the expected cost along a trajectory. This resulting policy enhances the quality of exploration early on in the learning process, and consequently allows faster convergence rates and robust solutions even in the presence of noisy data as demonstrated in our comparisons to popular algorithms such as Q-learning, Double Q-learning and entropy regularized Soft Q-learning. The framework extends to the class of parameterized MDP and RL problems, where states and actions are parameter dependent, and the objective is to determine the optimal parameters along with the corresponding optimal policy. Here, the associated cost function can possibly be non-convex with multiple poor local minima. Simulation results applied to a 5G small cell network problem demonstrate successful determination of communication routes and the small cell locations. We also obtain sensitivity measures to problem parameters and robustness to noisy environment data.
In this paper, an online evolving framework is proposed to detect and revise a controller's imperfect decision-making in advance. The framework consists of three modules: the evolving Finite State Machine (e-FSM), action-reviser, and controller modules. The e-FSM module evolves a stochastic model (e.g., Discrete-Time Markov Chain) from scratch by determining new states and identifying transition probabilities repeatedly. With the latest stochastic model and given criteria, the action-reviser module checks validity of the controller's chosen action by predicting future states. Then, if the chosen action is not appropriate, another action is inspected and selected. In order to show the advantage of the proposed framework, the Deep Deterministic Policy Gradient (DDPG) w/ and w/o the online evolving framework are applied to control an ego-vehicle in the car-following scenario where control criteria are set by speed and safety. Experimental results show that inappropriate actions chosen by the DDPG controller are detected and revised appropriately through our proposed framework, resulting in no control failures after a few iterations.
Text-based games are long puzzles or quests, characterized by a sequence of sparse and potentially deceptive rewards. They provide an ideal platform to develop agents that perceive and act upon the world using a combinatorially sized natural language state-action space. Standard Reinforcement Learning agents are poorly equipped to effectively explore such spaces and often struggle to overcome bottlenecks---states that agents are unable to pass through simply because they do not see the right action sequence enough times to be sufficiently reinforced. We introduce Q*BERT, an agent that learns to build a knowledge graph of the world by answering questions, which leads to greater sample efficiency. To overcome bottlenecks, we further introduce MC!Q*BERT an agent that uses an knowledge-graph-based intrinsic motivation to detect bottlenecks and a novel exploration strategy to efficiently learn a chain of policy modules to overcome them. We present an ablation study and results demonstrating how our method outperforms the current state-of-the-art on nine text games, including the popular game, Zork, where, for the first time, a learning agent gets past the bottleneck where the player is eaten by a Grue.
We take two specific approaches - first Value function approximation in Reinforcement Learning is to represent the lifted Q-value functions and the second (RL) has long been viewed using the lens of feature discovery is to represent the Bellman residuals - both using a set of (Parr et al. 2007). A set of classical approaches relational regression trees (RRTs) (Blockeel and De Raedt for this problem based on Approximate Dynamic Programming 1998). A key aspect of our approach is that it is model-free, (ADP) is the fitted value iteration algorithm (Boyan which most of the RMDP algorithms assume. The only exception and Moore 1995; Ernst, Geurts, and Wehenkel 2005; Riedmiller is Fern et al. (2006), who directly learn in policy 2005), a batch mode approximation scheme that employs space. Our work differs from their work in that we directly function approximators in each iteration to represent learn value functions and eventually policies from them the value estimates. Another popular class of methods that and adapt the most recently successful relational gradientboosting address this problem is Bellman error based methods (Menache, (RFGB) (Natarajan et al. 2014), which has been Mannor, and Shimkin 2005; Keller, Mannor, and Precup shown to outperform learning relational rules one by one.
Recently, for continuous control tasks, reinforcement learning (RL) algorithms based on actor-critic architecture  or policy optimization  have shown remarkably good performance. The policy and the value function are represented by deep neural networks and then the weights are updated accordingly. However,  shows that the performance of these RL algorithms vary a lot with changes in hyperparameters, network architecture etc. Furthermore,  showed that a simple linear policy-based method with weights updated by a random search method can outperform some of these state-of-the-art results. A key question is how far we can go by relying almost exclusively on these architectural biases. For Markov decision processes (MDPs) with discrete state and action spaces, model-based algorithms based on dynamic programming (DP) ideas  can be used when the model is known. Unfortunately, in many problems (e.g., robotics), the system model is unknown, or simply too complicated to be succinctly stated and used in DP algorithms. Usually, latter is the more likely case.
Real world multi-agent tasks often involve varying types and quantities of agents and non-agent entities. Agents frequently do not know a priori how many other agents and non-agent entities they will need to interact with in order to complete a given task, requiring agents to generalize across a combinatorial number of task configurations with each potentially requiring different strategies. In this work, we tackle the problem of multi-agent reinforcement learning (MARL) in such dynamic scenarios. We hypothesize that, while the optimal behaviors in these scenarios with varying quantities and types of agents/entities are diverse, they may share common patterns within sub-teams of agents that are combined to form team behavior. As such, we propose a method that can learn these subgroup relationships and how they can be combined, ultimately improving knowledge sharing and generalization across scenarios. This method, Attentive-Imaginative QMIX, extends QMIX for dynamic MARL in two ways: 1) an attention mechanism that enables model sharing across variable sized scenarios and 2) a training objective that improves learning across scenarios with varying combinations of agent/entity types by factoring the value function into imagined sub-scenarios. We validate our approach on both a novel grid-world task as well as a version of the StarCraft Multi-Agent Challenge  minimally modified for the dynamic scenario setting.
Active inference offers a first principle account of sentient behaviour, from which special and important cases can be derived, e.g., reinforcement learning, active learning, Bayes optimal inference, Bayes optimal design, etc. Active inference resolves the exploitation-exploration dilemma in relation to prior preferences, by placing information gain on the same footing as reward or value. In brief, active inference replaces value functions with functionals of (Bayesian) beliefs, in the form of an expected (variational) free energy. In this paper, we consider a sophisticated kind of active inference, using a recursive form of expected free energy. Sophistication describes the degree to which an agent has beliefs about beliefs. We consider agents with beliefs about the counterfactual consequences of action for states of affairs and beliefs about those latent states. In other words, we move from simply considering beliefs about "what would happen if I did that" to "what would I believe about what would happen if I did that". The recursive form of the free energy functional effectively implements a deep tree search over actions and outcomes in the future. Crucially, this search is over sequences of belief states, as opposed to states per se. We illustrate the competence of this scheme, using numerical simulations of deep decision problems.