Goto

Collaborating Authors

 Reinforcement Learning


A Benchmark Dataset for Learning to Intervene in Online Hate Speech

arXiv.org Artificial Intelligence

Countering online hate speech is a critical yet challenging task, but one which can be aided by the use of Natural Language Processing (NLP) techniques. Previous research has primarily focused on the development of NLP methods to automatically and effectively detect online hate speech while disregarding further action needed to calm and discourage individuals from using hate speech in the future. In addition, most existing hate speech datasets treat each post as an isolated instance, ignoring the conversational context. In this paper, we propose a novel task of generative hate speech intervention, where the goal is to automatically generate responses to intervene during online conversations that contain hate speech. As a part of this work, we introduce two fully-labeled large-scale hate speech intervention datasets collected from Gab and Reddit. These datasets provide conversation segments, hate speech labels, as well as intervention responses written by Mechanical Turk Workers. In this paper, we also analyze the datasets to understand the common intervention strategies and explore the performance of common automatic response generation methods on these new datasets to provide a benchmark for future research.


Signal Instructed Coordination in Team Competition

arXiv.org Artificial Intelligence

Most existing models of multi-agent reinforcement learning (MARL) adopt centralized training with decentralized execution framework. We demonstrate that the decentralized execution scheme restricts agents' capacity to find a better joint policy in team competition games, where each team of agents share the common rewards and cooperate to compete against other teams. To resolve this problem, we propose Signal Instructed Coordination (SIC), a novel coordination module that can be integrated with most existing models. SIC casts a common signal sampled from a pre-defined distribution to team members, and adopts an information-theoretic regularization to encourage agents to exploit in learning the instruction of centralized signals. Our experiments show that SIC can consistently improve team performance over well-recognized MARL models on matrix games and predator-prey games.


An Efficient Algorithm for Multiple-Pursuer-Multiple-Evader Pursuit/Evasion Game

arXiv.org Artificial Intelligence

We present a method for pursuit/evasion that is highly efficient and and scales to large teams of aircraft. The underlying algorithm is an efficient algorithm for solving Markov Decision Processes (MDPs) that supports fully continuous state spaces. We demonstrate the algorithm in a team pursuit/evasion setting in a 3D environment using a pseudo-6DOF model and study performance by varying sizes of team members. We show that as the number of aircraft in the simulation grows, computational performance remains efficient and is suitable for real-time systems. We also define probability-to-win and survivability metrics that describe the teams' performance over multiple trials, and show that the algorithm performs consistently. We provide numerical results showing control inputs for a typical 1v1 encounter and provide videos for 1v1, 2v2, 3v3, 4v4, and 10v10 contests to demonstrate the ability of the algorithm to adapt seamlessly to complex environments.


Option Encoder: A Framework for Discovering a Policy Basis in Reinforcement Learning

arXiv.org Artificial Intelligence

Option discovery and skill acquisition frameworks are integral to the functioning of a Hierarchically organized Reinforcement learning agent. However, such techniques often yield a large number of options or skills, which can potentially be represented succinctly by filtering out any redundant information. Such a reduction can reduce the required computation while also improving the performance on a target task. In order to compress an array of option policies, we attempt to find a policy basis that accurately captures the set of all options. In this work, we propose Option Encoder, an auto-encoder based framework with intelligently constrained weights, that helps discover a collection of basis policies. The policy basis can be used as a proxy for the original set of skills in a suitable hierarchically organized framework. We demonstrate the efficacy of our method on a collection of grid-worlds and on the high-dimensional Fetch-Reach robotic manipulation task by evaluating the obtained policy basis on a set of downstream tasks.


Gradient-Aware Model-based Policy Search

arXiv.org Artificial Intelligence

Traditional model-based reinforcement learning approaches learn a model of the environment dynamics without explicitly considering how it will be used by the agent. In the presence of misspecified model classes, this can lead to poor estimates, as some relevant available information is ignored. In this paper, we introduce a novel model-based policy search approach that exploits the knowledge of the current agent policy to learn an approximate transition model, focusing on the portions of the environment that are most relevant for policy improvement. We leverage a weighting scheme, derived from the minimization of the error on the model-based policy gradient estimator, in order to define a suitable objective function that is optimized for learning the approximate transition model. Then, we integrate this procedure into a batch policy improvement algorithm, named Gradient-Aware Model-based Policy Search (GAMPS), which iteratively learns a transition model and uses it, together with the collected trajectories, to compute the new policy parameters. Finally, we empirically validate GAMPS on benchmark domains analyzing and discussing its properties.


Policy Space Identification in Configurable Environments

arXiv.org Artificial Intelligence

We study the problem of identifying the policy space of a learning agent, having access to a set of demonstrations generated by its optimal policy. We introduce an approach based on statistical testing to identify the set of policy parameters the agent can control, within a larger parametric policy space. After presenting two identification rules (combinatorial and simplified), applicable under different assumptions on the policy space, we provide a probabilistic analysis of the simplified one in the case of linear policies belonging to the exponential family. To improve the performance of our identification rules, we frame the problem in the recently introduced framework of the Configurable Markov Decision Processes, exploiting the opportunity of configuring the environment to induce the agent revealing which parameters it can control. Finally, we provide an empirical evaluation, on both discrete and continuous domains, to prove the effectiveness of our identification rules.


Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

arXiv.org Artificial Intelligence

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a $\textit{fixed}$ number of future time steps. To learn the value function for horizon $h$, these algorithms bootstrap from the value function for horizon $h-1$, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as "the deadly triad"). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and $n$-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.


Solving Continual Combinatorial Selection via Deep Reinforcement Learning

arXiv.org Artificial Intelligence

We consider the Markov Decision Process (MDP) of selecting a subset of items at each step, termed the Select-MDP (S-MDP). The large state and action spaces of S-MDPs make them intractable to solve with typical reinforcement learning (RL) algorithms especially when the number of items is huge. In this paper, we present a deep RL algorithm to solve this issue by adopting the following key ideas. First, we convert the original S-MDP into an Iterative Select-MDP (IS-MDP), which is equivalent to the S-MDP in terms of optimal actions. IS-MDP decomposes a joint action of selecting K items simultaneously into K iterative selections resulting in the decrease of actions at the expense of an exponential increase of states. Second, we overcome this state space explo-sion by exploiting a special symmetry in IS-MDPs with novel weight shared Q-networks, which prov-ably maintain sufficient expressive power. Various experiments demonstrate that our approach works well even when the item space is large and that it scales to environments with item spaces different from those used in training.


An introduction to Reinforcement Learning

#artificialintelligence

This episode gives a general introduction into the field of Reinforcement Learning: - High level description of the field - Policy gradients - Biggest challenges (sparse rewards, reward shaping, ...) This video forms the basis for a series on RL where I will dive much deeper into technical details of state-of-the-art methods for RL. Links: - "Pong from Pixels - Karpathy": http://karpathy.github.io/2016/05/31/rl/ - Concept networks for grasp & stack (Paper with heavy reward shaping): https://arxiv.org/abs/1709.06977 If you enjoy my videos, all support is super welcome!


Personalized HeartSteps: A Reinforcement Learning Algorithm for Optimizing Physical Activity

arXiv.org Artificial Intelligence

With the recent evolution of mobile health technologies, health scientists are increasingly interested in delivering interventions via notifications on mobile device at the moments when they can most readily help the user prevent negative health outcomes and promote the adoption and maintenance of healthy behaviors. The type and timing of the mobile health interventions should ideally adapt to the real-time collected user's context, e.g., the time of the day, the location, current activity and stress level. This gives rise to the concept of a justin-time adaptive intervention (JITAI) [28]. Operationally, JITAI includes a sequence of decision rules (e.g., treatment policy) that takes the user's current context as input and specifies whether and what type of an intervention should be provided at the moment. In practice, behavioral theory along with expert opinion and analyses of existing data is often used to design the decision rules. However, these theories are often insufficiently mature to precisely specify which particular intervention and when it should be delivered in order to ensure the interventions have the intended effects and optimize the long-term efficacy of the interventions. As a result, there is much interest in how best to use data to inform the design of JITAIs [12, 39, 3, 35, 26, 41, 33, 10, 34, 42] This paper develops a Reinforcement Learning (RL) algorithm to continuously learn, e.g., online, and optimize the treatment policy in the JITAI as the user experiences the intervention.