Goto

Collaborating Authors

 Reinforcement Learning


Deep Reinforcement Learning for Green Security Game with Online Information

AAAI Conferences

Motivated by the urgent need in green security domains such as protecting endangered wildlife from poaching and preventing illegal logging, researchers have proposed game theoretic models to optimize patrols conducted by law enforcement agencies. Despite the efforts, online information and online interactions (e.g., patrollers chasing the poachers by following their footprints) have been neglected in previous game models and solutions. Our research aims at providing a more practical solution for the complex real-world green security problems by empowering security games with deep reinforcement learning. Specifically, we propose a novel game model which incorporates the vital element of online information and provide a discussion of possible solutions as well as promising future research directions based on game theory and deep reinforcement learning.


Consequentialist Conditional Cooperation in Social Dilemmas with Imperfect Information (Short Workshop Version)

AAAI Conferences

Social dilemmas, where mutual cooperation can lead to high payoffs but participants face incentives to cheat, are ubiquitous in multi-agent interaction. We wish to construct agents that cooperate with pure cooperators, avoid exploitation by pure defectors, and incentivize cooperation from the rest. However, often the actions taken by a partner are (partially) unobserved or the consequences of individual actions are hard to predict. We show that in a large class of games good strategies can be constructed by conditioning one's behavior solely on outcomes (i.e. one's past rewards). We call this consequentialist conditional cooperation. We show how to construct such strategies using deep reinforcement learning techniques and demonstrate, both analytically and experimentally, that they are effective in social dilemmas beyond simple matrix games. We also show the limitations of relying purely on consequences and discuss the need for understanding both the consequences of and the intentions behind an action.


MEETING BOT: Reinforcement Learning for Dialogue Based Meeting Scheduling

AAAI Conferences

In this paper we present Meeting Bot, a reinforcement learning based conversational system that interacts with multiple users to schedule meetings. The system is able to interpret user utterences and map them to preferred time slots, which are then fed to a reinforcement learning (RL) system with the goal of converging on an agreeable time slot. The RL system is able to adapt to user preferences and environmental changes in meeting arrival rate while still scheduling effectively. Learning is performed via policy gradient with exploration, by utilizing an MLP as an approximator of the policy function. Results demonstrate that the system outperforms standard scheduling algorithms in terms of overall scheduling efficiency. Additionally, the system is able to adapt its strategy to situations when users consistently reject or accept meetings in certain slots (such as Friday afternoon versus Thursday morning), or when the meeting is called by members who are at a more senior designation.


Inverse Reinforcement Learning Based Human Behavior Modeling for Goal Recognition in Dynamic Local Network Interdiction

AAAI Conferences

Goal recognition is the task of inferring an agent's goals given some or all of the agentโ€™s observed actions. Among different ways of problem formulation, goal recognition can be solved as a model-based planning problem using off-the-shell planners. However, obtaining accurate cost or reward models of an agent and incorporating them into the planning model becomes an issue in real applications. Towards this end, we propose an Inverse Reinforcement Learning (IRL)-based opponent behavior modeling method, and apply it in the goal recognition assisted Dynamic Local Network Interdiction (DLNI) problem. We first introduce the overall framework and the DLNI problem domain of our work. After that, an IRL-based human behavior modeling method and Markov Decision Process-based goal recognition are introduced. Experimental results indicate that our learned behavior model has a higher tracking accuracy and yields better interdiction outcomes than other models.


Concept-Aware Feature Extraction for Knowledge Transfer in Reinforcement Learning

AAAI Conferences

We introduce a novel mechanism for knowledge transfer via concept formation to augment reinforcement learning agents operating in complex, uncertain domains. Based on their observations, agents form concepts and associate them with actions to generalize their decisions at higher levels of abstraction. Concepts serve as simple, portable, efficient packets of hierarchical information that can be learned in parallel. The use of conceptual knowledge simultaneously provides an interpretable, semantic explanation of an agent's decisions, making the techniques promising for human-interaction domains such as games, where human observers wish to inspect an agent's rationale. This technique extends previous work on probabilistic learning with Markov decision processes (MDPs) by introducing rich hierarchical feature structures that can be learned from experience, enabling more effective learning transfer to new, related tasks.


Autonomous Model Management via Reinforcement Learning

AAAI Conferences

Concept drift - a change, either sudden or gradual, in the underlying properties of data - is one of the most prevalent challenges to maintaining high-performing learned models over time in autonomous systems. In the face of concept drift, one can hope that the old model is sufficiently representative of the new data despite the concept drift, one can discard the old data and retrain a new model with (often limited) new data, or one can use transfer learning methods to combine the old data with the new to create an updated model. Which of these three options is chosen affects not only near-term decisions, but also future needs to transfer or retrain. In this paper, we thus model response to concept drift as a sequential decision making problem and formally frame it as a Markov Decision Process. Our reinforcement learning approach to the problem shows promising results on one synthetic and two real-world datasets.


Evaluating the Stability of Non-Adaptive Trading in Continuous Double Auctions: A Reinforcement Learning Approach

AAAI Conferences

The continuous double auction (CDA) is the predominant mechanism in modern securities markets. Despite much prior study of CDA strategies, fundamental questions about the CDA remain open, such as: (1) to what extent can outcomes in a CDA be accurately modeled by optimizing agent actions over only a simple, non-adaptive policy class; and (2) when and how can a policy that conditions its actions on market state deviate beneficially from an optimally parameterized, but simpler, policy like Zero Intelligence (ZI). To investigate these questions, we present an experimental comparison of the strategic stability of policies found by reinforcement learning (RL) over a massive space, or through empirical Nash-equilibrium solving over a smaller space of non-adaptive, ZI policies. Our findings indicate that in a plausible market environment, an adaptive trading policy can deviate beneficially from an equilibrium of ZI traders, by conditioning on signals of the likelihood a trade will execute or the favorability of the current bid and ask. Nevertheless, the surplus earned by well-calibrated ZI policies is empirically observed to be nearly as great as what a deviating reinforcement learner could earn, using a much larger policy space. This finding supports the idea that it is reasonable to use equilibrated ZI traders in studies of CDA market outcomes.


Programmatically Interpretable Reinforcement Learning

arXiv.org Machine Learning

We study the problem of generating interpretable and verifiable policies through reinforcement learning. Unlike the popular Deep Reinforcement Learning (DRL) paradigm, in which the policy is represented by a neural network, the aim in Programmatically Interpretable Reinforcement Learning is to find a policy that can be represented in a high-level programming language. Such programmatic policies have the benefits of being more easily interpreted than neural networks, and being amenable to verification by symbolic methods. We propose a new method, called Neurally Directed Program Search (NDPS), for solving the challenging nonsmooth optimization problem of finding a programmatic policy with maxima reward. NDPS works by first learning a neural policy network using DRL, and then performing a local search over programmatic policies that seeks to minimize a distance from this neural "oracle". We evaluate NDPS on the task of learning to drive a simulated car in the TORCS car-racing environment. We demonstrate that NDPS is able to discover human-readable policies that pass some significant performance bars. We also find that a well-designed policy language can serve as a regularizer, and result in the discovery of policies that lead to smoother trajectories and are more easily transferred to environments not encountered during training.


A Human Mixed Strategy Approach to Deep Reinforcement Learning

arXiv.org Machine Learning

In 2015, Google's DeepMind announced an advancement in creating an autonomous agent based on deep reinforcement learning (DRL) that could beat a professional player in a series of 49 Atari games. However, the current manifestation of DRL is still immature, and has significant drawbacks. One of DRL's imperfections is its lack of "exploration" during the training process, especially when working with high-dimensional problems. In this paper, we propose a mixed strategy approach that mimics behaviors of human when interacting with environment, and create a "thinking" agent that allows for more efficient exploration in the DRL training process. The simulation results based on the Breakout game show that our scheme achieves a higher probability of obtaining a maximum score than does the baseline DRL algorithm, i.e., the asynchronous advantage actor-critic method. The proposed scheme therefore can be applied effectively to solving a complicated task in a real-world application.


Introduction to Various Reinforcement Learning Algorithms. Part II (TRPO, PPO)

#artificialintelligence

Advantage is a term that is commonly used in numerous advanced RL algorithms, such as A3C, NAF, and the algorithms that I am going to discuss (perhaps I will write another blog post for these two algorithms). To view it in a more intuitive manner, think of it as how good an action is compared to the average action for a specific state. But why do we need advantage? I will use an example posted in this forum to illustrate the idea of advantage. Have you ever played a game called "Catch"? In the game, fruits will be dropping down from the top of the screen.