Goto

Collaborating Authors

 Reinforcement Learning


Modelling User's Theory of AI's Mind in Interactive Intelligent Systems

arXiv.org Machine Learning

Many interactive intelligent systems, such as recommendation and information retrieval systems, treat users as a passive data source. Yet, users form mental models of systems and instead of passively providing feedback to the queries of the system, they will strategically plan their actions within the constraints of the mental model to steer the system and achieve their goals faster. We propose to explicitly account for the user's theory of the AI's mind in the user model: the intelligent system has a model of the user having a model of the intelligent system. We study a case where the system is a contextual bandit and the user model is a Markov decision process that plans based on a simpler model of the bandit. Inference in the model can be reduced to probabilistic inverse reinforcement learning, with the nested bandit model defining the transition dynamics, and is implemented using probabilistic programming. Our results show that improved performance is achieved if users can form accurate mental models that the system can capture, implying predictability of the interactive intelligent system is important not only for the user experience but also for the design of the system's statistical models.


Optimal and Low-Complexity Dynamic Spectrum Access for RF-Powered Ambient Backscatter System with Online Reinforcement Learning

arXiv.org Artificial Intelligence

Ambient backscatter has been introduced with a wide range of applications for low power wireless communications. In this article, we propose an optimal and low-complexity dynamic spectrum access framework for RFpowered ambient backscatter system. Under the dynamics of the ambient signals, we first adopt the Markov decision process (MDP) framework to obtain the optimal policy for the secondary transmitter, aiming to maximize the system throughput. However, the MDP-based optimization requires complete knowledge of environment parameters, e.g., the probability of a channel to be idle and the probability of a successful packet transmission, that may not be practical to obtain. To cope with such incomplete knowledge of the environment, we develop a low-complexity online reinforcement learning algorithm that allows the secondary transmitter to "learn" from its decisions and then attain the optimal policy. Simulation results show that the proposed learning algorithm not only efficiently deals with the dynamics of the environment, but also improves the average throughput up to 50% and reduces the blocking probability and delay up to 80% compared with conventional methods. Dynamic spectrum access (DSA) has been considered as a promising solution to improve the utilization of radio spectrum [2]. As DSA standard frameworks, the Federal Communications Commission and the European Telecommunications Standardization Institute have recently proposed Spectrum Access Systems (SAS) and Licensed Shared Access (LSA) respectively [3]. In both SAS and LSA, spectrum users are prioritized at different levels/tiers (e.g., there are three types of users with a decreasing order of priority: Incumbent Users (IUs), Priority Access Licensees (PALs), and General Authorized Access (GAAs)). Without loss of generality, in this work, we refer users with higher priority as IUs and users with lower priority as secondary users (SUs). DSA harvests under-utilized spectrum chunks by allowing an SU to dynamically access (temporarily) idle spectrum bands/whitespaces to transmit data.


Unity: A General Platform for Intelligent Agents

arXiv.org Machine Learning

Recent advances in Deep Reinforcement Learning and Robotics have been driven by the presence of increasingly realistic and complex simulation environments. Many of the existing platforms, however, provide either unrealistic visuals, inaccurate physics, low task complexity, or a limited capacity for interaction among artificial agents. Furthermore, many platforms lack the ability to flexibly configure the simulation, hence turning the simulation environment into a black-box from the perspective of the learning system. Here we describe a new open source toolkit for creating and interacting with simulation environments using the Unity platform: Unity ML-Agents Toolkit. By taking advantage of Unity as a simulation platform, the toolkit enables the development of learning environments which are rich in sensory and physical complexity, provide compelling cognitive challenges, and support dynamic multi-agent interaction. We detail the platform design, communication protocol, set of example environments, and variety of training scenarios made possible via the toolkit.


Learning Invariances for Policy Generalization

arXiv.org Artificial Intelligence

The grey rectangle starts on the left of the screen and can be moved with two actions, "Right" and "Jump". The goal of this game is to reach the right of the screen while avoiding the white obstacle. There is only one specific distance (measured in number of pixels) to the obstacle where the agent has to chose the action "Jump" in order to pass over the obstacle. If jumping is chosen at any other point, the agent will inevitably crash into the obstacle. A reward of 1 is granted anytime the agent moves one pixel to the right (even in the air). The episode terminates if the agent reaches the right of the screen or touches the obstacle. We build a set of related tasks by varying two factors: the floor height and the position of the obstacle on the floor. The resulting set contains 1271 tasks. We use 6 of those for training and evaluate the generalization performance as the fraction of the remaining 1265 tasks the agent can solve.


Improving On-policy Learning with Statistical Reward Accumulation

arXiv.org Artificial Intelligence

Deep reinforcement learning has obtained significant breakthroughs in recent years. Most methods in deep-RL achieve good results via the maximization of the reward signal provided by the environment, typically in the form of discounted cumulative returns. Such reward signals represent the immediate feedback of a particular action performed by an agent. However, tasks with sparse reward signals are still challenging to on-policy methods. In this paper, we introduce an effective characterization of past reward statistics (which can be seen as long-term feedback signals) to supplement this immediate reward feedback. In particular, value functions are learned with multi-critics supervision, enabling complex value functions to be more easily approximated in on-policy learning, even when the reward signals are sparse. We also introduce a novel exploration mechanism called "hot-wiring" that can give a boost to seemingly trapped agents. We demonstrate the effectiveness of our advantage actor multi-critic (A2MC) method across the discrete domains in Atari games as well as continuous domains in the MuJoCo environments. A video demo is provided at https://youtu.be/zBmpf3Yz8tc.


Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation

arXiv.org Artificial Intelligence

We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherentmulti-sentence stories for the visual storytelling task. Within our framework, the task of generating a story given a sequence of images is divided across a two-level hierarchical decoder.The high-level decoder constructs a plan by generating a semantic concept (i.e., topic) for each image in sequence. The low-level decoder generates a sentence for each image using a semantic compositional network, which effectively grounds the sentence generation conditioned on the topic. The two decoders are jointly trained end-to-end using reinforcement learning. We evaluate our model on the visual storytelling (VIST) dataset. Empirical results from both automatic and human evaluations demonstrate that the proposed hierarchicallystructured reinforced training achieves significantly better performance compared to a strong flat deep reinforcement learning baseline.


Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning

arXiv.org Machine Learning

Learning how to act when there are many available actions in each state is a challenging task for Reinforcement Learning (RL) agents, especially when many of the actions are redundant or irrelevant. In such cases, it is sometimes easier to learn which actions not to take. In this work, we propose the Action-Elimination Deep Q-Network (AE-DQN) architecture that combines a Deep RL algorithm with an Action Elimination Network (AEN) that eliminates sub-optimal actions. The AEN is trained to predict invalid actions, supervised by an external elimination signal provided by the environment. Simulations demonstrate a considerable speedup and added robustness over vanilla DQN in text-based games with over a thousand discrete actions.


Sample-Efficient Imitation Learning via Generative Adversarial Nets

arXiv.org Machine Learning

Recent work in imitation learning articulate their formulation around the GAIL architecture, relying on the adversarial training procedure introduced in GANs. Albeit successful at generating behaviours similar to those demonstrated to the agent, GAIL suffers from a high sample complexity in the number of interactions it has to carry out in the environment in order to achieve satisfactory performance. In this work, we dramatically shrink the amount of interactions with the environment by leveraging an off-policy actor-critic architecture. Additionally, employing deterministic policy gradients allows us to treat the learned reward as a differentiable node in the computational graph, while preserving the model-free nature of our approach. Our experiments span a variety of continuous control tasks.


Model-Based Stabilisation of Deep Reinforcement Learning

arXiv.org Machine Learning

Though successful in high-dimensional domains, deep reinforcement learning exhibits high sample complexity and suffers from stability issues as reported by researchers and practitioners in the field. These problems hinder the application of such algorithms in real-world and safety-critical scenarios. In this paper, we take steps towards stable and efficient reinforcement learning by following a model-based approach that is known to reduce agent-environment interactions. Namely, our method augments deep Q-networks (DQNs) with model predictions for transitions, rewards, and termination flags. Having the model at hand, we then conduct a rigorous theoretical study of our algorithm and show, for the first time, convergence to a stationary point. En route, we provide a counter-example showing that 'vanilla' DQNs can diverge confirming practitioners' and researchers' experiences. Our proof is novel in its own right and can be extended to other forms of deep reinforcement learning. In particular, we believe exploiting the relation between reinforcement (with deep function approximators) and online learning can serve as a recipe for future proofs in the domain. Finally, we validate our theoretical results in 20 games from the Atari benchmark. Our results show that following the proposed model-based learning approach not only ensures convergence but leads to a reduction in sample complexity and superior performance.


Learning Optimal Fair Policies

arXiv.org Machine Learning

We consider the problem of learning optimal policies from observational data in a way that satisfies certain fairness criteria. The issue of fairness arises where some covariates used in decision making are sensitive features, or are correlated with sensitive features. (Nabi and Shpitser 2018) formalized fairness in the context of regression problems as constraining the causal effects of sensitive features along certain disallowed causal pathways. The existence of these causal effects may be called retrospective unfairness in the sense of already being present in the data before analysis begins, and may be due to discriminatory practices or the biased way in which variables are defined or recorded. In the context of learning policies, what we call prospective bias, i.e., the inappropriate dependence of learned policies on sensitive features, is also possible. In this paper, we use methods from causal and semiparametric inference to learn optimal policies in a way that addresses both retrospective bias in the data, and prospective bias due to the policy. In addition, our methods appropriately address statistical bias due to model misspecification and confounding bias, which are important in the estimation of path-specific causal effects from observational data. We apply our methods to both synthetic data and real criminal justice data.