Reinforcement Learning
XDO: A Double Oracle Algorithm for Extensive-Form Games
McAleer, Stephen, Lanier, John, Baldi, Pierre, Fox, Roy
Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm for two-player zero-sum games that has empirically found approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to a Nash equilibrium, it may take an exponential number of iterations as the number of infostates grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations 1-2 orders of magnitude smaller than PSRO. In experiments on a modified Leduc poker game, we show that tabular XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP in the same amount of time. We also show that NXDO beats PSRO and is competitive with NFSP on a large no-limit poker game.
A Vision Based Deep Reinforcement Learning Algorithm for UAV Obstacle Avoidance
Roghair, Jeremy, Ko, Kyungtae, Asli, Amir Ehsan Niaraki, Jannesari, Ali
Integration of reinforcement learning with unmanned aerial vehicles (UAVs) to achieve autonomous flight has been an active research area in recent years. An important part focuses on obstacle detection and avoidance for UAVs navigating through an environment. Exploration in an unseen environment can be tackled with Deep Q-Network (DQN). However, value exploration with uniform sampling of actions may lead to redundant states, where often the environments inherently bear sparse rewards. To resolve this, we present two techniques for improving exploration for UAV obstacle avoidance. The first is a convergence-based approach that uses convergence error to iterate through unexplored actions and temporal threshold to balance exploration and exploitation. The second is a guidance-based approach using a Domain Network which uses a Gaussian mixture distribution to compare previously seen states to a predicted next state in order to select the next action. Performance and evaluation of these approaches were implemented in multiple 3-D simulation environments, with variation in complexity. The proposed approach demonstrates a two-fold improvement in average rewards compared to state of the art.
Improving Context-Based Meta-Reinforcement Learning with Self-Supervised Trajectory Contrastive Learning
Wang, Bernie, Xu, Simon, Keutzer, Kurt, Gao, Yang, Wu, Bichen
Meta-reinforcement learning typically requires orders of magnitude more samples than single task reinforcement learning methods. This is because meta-training needs to deal with more diverse distributions and train extra components such as context encoders. To address this, we propose a novel self-supervised learning task, which we named Trajectory Contrastive Learning (TCL), to improve meta-training. TCL adopts contrastive learning and trains a context encoder to predict whether two transition windows are sampled from the same trajectory. TCL leverages the natural hierarchical structure of context-based meta-RL and makes minimal assumptions, allowing it to be generally applicable to context-based meta-RL algorithms. It accelerates the training of context encoders and improves meta-training overall. Experiments show that TCL performs better or comparably than a strong meta-RL baseline in most of the environments on both meta-RL MuJoCo (5 of 6) and Meta-World benchmarks (44 out of 50).
Hard Attention Control By Mutual Information Maximization
Sahni, Himanshu, Isbell, Charles
Biological agents have adopted the principle of attention to limit the rate of incoming information from the environment. One question that arises is if an artificial agent has access to only a limited view of its surroundings, how can it control its attention to effectively solve tasks? We propose an approach for learning how to control a hard attention window by maximizing the mutual information between the environment state and the attention location at each step. The agent employs an internal world model to make predictions about its state and focuses attention towards where the predictions may be wrong. Attention is trained jointly with a dynamic memory architecture that stores partial observations and keeps track of the unobserved state. We demonstrate that our approach is effective in predicting the full state from a sequence of partial observations. We also show that the agent's internal representation of the surroundings, a live mental map, can be used for control in two partially observable reinforcement learning tasks. Reinforcement learning (RL) algorithms have successfully employed neural networks over the past few years, surpassing human level performance in many tasks (Mnih et al., 2015; Silver et al., 2017; Berner et al., 2019; Schulman et al., 2017). But a key difference in the way tasks are performed by humans versus RL algorithms is that humans have the ability to focus on parts of the state at a time, using attention to limit the amount of information gathered at every step. We actively control our attention to build an internal representation of our surroundings over multiple fixations (Fourtassi et al., 2017; Barrouillet et al., 2004; Yarbus, 2013; Itti, 2005). We also use memory and internal world models to predict motions of dynamic objects in the scene when they are not under direct observation (Bosco et al., 2012). By limiting the amount of input information in these two ways, i.e. directing attention only where needed and internally modeling the rest of the environment, we are able to be more efficient in terms of data that needs to be collected from the environment and processed at each time step. By contrast, modern reinforcement learning methods often operate on the entire state.
Causal-aware Safe Policy Improvement for Task-oriented dialogue
Ramachandran, Govardana Sachithanandam, Hashimoto, Kazuma, Xiong, Caiming
The recent success of reinforcement learning's (RL) in solving complex tasks is most often attributed to its capacity to explore and exploit an environment where it has been trained. Sample efficiency is usually not an issue since cheap simulators are available to sample data on-policy. On the other hand, task oriented dialogues are usually learnt from offline data collected using human demonstrations. Collecting diverse demonstrations and annotating them is expensive. Unfortunately, use of RL methods trained on off-policy data are prone to issues of bias and generalization, which are further exacerbated by stochasticity in human response and non-markovian belief state of a dialogue management system. To this end, we propose a batch RL framework for task oriented dialogue policy learning: causal aware safe policy improvement (CASPI). This method gives guarantees on dialogue policy's performance and also learns to shape rewards according to intentions behind human responses, rather than just mimicking demonstration data; this couple with batch-RL helps overall with sample efficiency of the framework. We demonstrate the effectiveness of this framework on a dialogue-context-to-text Generation and end-to-end dialogue task of the Multiwoz2.0 dataset. The proposed method outperforms the current state of the art on these metrics, in both case. In the end-to-end case, our method trained only on 10\% of the data was able to out perform current state in three out of four evaluation metrics.
Using Cognitive Models to Train Warm Start Reinforcement Learning Agents for Human-Computer Interactions
Zhang, Chao, Wang, Shihan, Aarts, Henk, Dastani, Mehdi
Reinforcement learning (RL) has gained growing popularity in many human-computer interaction (HCI) applications [1, 2, 3]. In digital health interventions, for example, RL is a natural choice for personalization as RL agents can continuous adapt their strategies based on users' responses to the interventions [3]. Moreover, the recent advances in interactive RL calls for contributions from HCI researchers to improve the efficiency of RL algorithms [4]. While there is a natural fit between RL and HCI, the well-known data greedy property of reinforcement learning makes the RL-based systems often suffer from the cold start problem [5]. In HCI, as very few (or even no) experiences with users are available at the beginning in general, RL agents are required to interact many times with users prior to performing well. Many researchers had made efforts to overcome this challenge by shortening the learning process. Several approaches have been proposed to perform a faster online learning so that less interactions are needed in practice. For instance, Tabatabaei et al. [6] and Tomkins et al. [7] make RL algorithms quickly learn from the limited experience This is a preprint of our position paper presented to the "Reinforcement Learning for Humans, Computer, and Interaction (RL4HCI)" workship at ACM CHI2021, https://sites.google.com/view/rl4hci/home. The preprint is published under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
I am Robot: Neuromuscular Reinforcement Learning to Actuate Human Limbs through Functional Electrical Stimulation
Wannawas, Nat, Shafti, Ali, Faisal, A. Aldo
Functional Electrical Stimulation (FES) is an established and safe technique for contracting muscles by stimulating the skin above a muscle to induce its contraction. However, an open challenge remains on how to restore motor abilities to human limbs through FES, as the problem of controlling the stimulation is unclear. We are taking a robotics perspective on this problem, by developing robot learning algorithms that control the ultimate humanoid robot, the human body, through electrical muscle stimulation. Human muscles are not trivial to control as actuators due to their force production being non-stationary as a result of fatigue and other internal state changes, in contrast to robot actuators which are wellunderstood and stationary over broad operation ranges. We present our Deep Reinforcement Learning approach to the control of human muscles with FES, using a recurrent neural network for dynamic state representation, to overcome the unobserved elements of the behaviour of human muscles under external stimulation. We demonstrate our technique both in neuromuscular simulations but also experimentally on a human. Our results show that our controller can learn to manipulate human muscles, applying appropriate levels of stimulation to achieve the given tasks while compensating for advancing muscle fatigue which arises throughout the tasks. Additionally, Figure 1: Our 3 scenarios for FES control: (a) arm vertical motion our technique can learn quickly enough to be implemented in in simulation (b) and human volunteers, (c) arm horizontal motion real-world human-in-the-loop settings.
Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds
Feng, Yihao, Tang, Ziyang, Zhang, Na, Liu, Qiang
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. Therefore, OPE is a key step in applying reinforcement learning to real-world domains such as medical treatment, where interactive data collection is expensive or even unsafe. As the observed data tends to be noisy and limited, it is essential to provide rigorous uncertainty quantification, not just a point estimation, when applying OPE to make high stakes decisions. This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question. We develop a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss (KBL) of Feng et al.(2019) and a new martingale concentration inequality of KBL applicable to time-dependent data with unknown mixing conditions. Our algorithm makes minimum assumptions on the data and the function class of the Q-function, and works for the behavior-agnostic settings where the data is collected under a mix of arbitrary unknown behavior policies. We present empirical results that clearly demonstrate the advantages of our approach over existing methods.
Whittle index based Q-learning for restless bandits with average reward
Avrachenkov, Konstantin, Borkar, Vivek S.
A novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provided, supported by numerical experiments. The numerical experiments show excellent empirical performance of the proposed scheme.
A Two-stage Framework and Reinforcement Learning-based Optimization Algorithms for Complex Scheduling Problems
He, Yongming, Wu, Guohua, Chen, Yingwu, Pedrycz, Witold
There hardly exists a general solver that is efficient for scheduling problems due to their diversity and complexity. In this study, we develop a two-stage framework, in which reinforcement learning (RL) and traditional operations research (OR) algorithms are combined together to efficiently deal with complex scheduling problems. The scheduling problem is solved in two stages, including a finite Markov decision process (MDP) and a mixed-integer programming process, respectively. This offers a novel and general paradigm that combines RL with OR approaches to solving scheduling problems, which leverages the respective strengths of RL and OR: The MDP narrows down the search space of the original problem through an RL method, while the mixed-integer programming process is settled by an OR algorithm. These two stages are performed iteratively and interactively until the termination criterion has been met. Under this idea, two implementation versions of the combination methods of RL and OR are put forward. The agile Earth observation satellite scheduling problem is selected as an example to demonstrate the effectiveness of the proposed scheduling framework and methods. The convergence and generalization capability of the methods are verified by the performance of training scenarios, while the efficiency and accuracy are tested in 50 untrained scenarios. The results show that the proposed algorithms could stably and efficiently obtain satisfactory scheduling schemes for agile Earth observation satellite scheduling problems. In addition, it can be found that RL-based optimization algorithms have stronger scalability than non-learning algorithms. This work reveals the advantage of combining reinforcement learning methods with heuristic methods or mathematical programming methods for solving complex combinatorial optimization problems.