Reinforcement Learning
Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning
Yin, Ming, Bai, Yu, Wang, Yu-Xiang
Reinforcement learning (RL), is the problem described by an agent interacting with an environment in order to maximize its cumulative rewards through time Sutton & Barto (2018). Among the great landscapes of reinforcement learning, off-policy evaluation (OPE) refers to the problem of predicting the performance of a policy with data only collected by a logging/behavioral policy, and is of crucial importance to real-world applications of RL including marketing Thomas et al. (2017), targeted advertising Bottou et al. (2013); Tang et al.(2013), finance Bertoluzzo & Corazza(2012), robotics Quillen et al.(2018), and healthcare Ernst et al. (2006); Raghu et al. (2017, 2018). A central challenge in OPE is the distributional mismatch between the behavioral policy and the target policy, which has been tackled in previous studies using Importance Sampling (IS) based methods Li et al. (2011); Dudík et al. (2011); Li et al. (2015); Thomas & Brunskill (2016) or its hybridversions such as doubly robust estimators Jiang & Li (2016); Farajtabar et al. (2018). More recently, a family of estimators based on marginalized importance sampling (MIS) Liu et al. (2018); Xie et al. (2019); Kallus & Uehara (2019a,b); Yin & Wang (2020) have been proposed in order to overcome the "curse of horizon", which refers to the phenomenon of OPE problem that any unbiased estimator has to suffer the variance which is exponential in horizon for some MDP class Jiang & Li (2016); Liu et al. (2018). However, previous works only consider the OPE problem of a fixed (non data-dependent) target policy π, whereas in practice it is common that we need to evaluate the performance of a data-dependent one.
Deep Reinforcement Learning and its Neuroscientific Implications
Botvinick, Matthew, Wang, Jane X., Dabney, Will, Miller, Kevin J., Kurth-Nelson, Zeb
The emergence of powerful artificial intelligence is defining new research directions in neuroscience. To date, this research has focused largely on deep neural networks trained using supervised learning, in tasks such as image classification. However, there is another area of recent AI work which has so far received less attention from neuroscientists, but which may have profound neuroscientific implications: deep reinforcement learning. Deep RL offers a comprehensive framework for studying the interplay among learning, representation and decision-making, offering to the brain sciences a new set of research tools and a wide range of novel hypotheses. In the present review, we provide a high-level introduction to deep RL, discuss some of its initial applications to neuroscience, and survey its wider implications for research on brain and behavior, concluding with a list of opportunities for next-stage research.
Provably Safe PAC-MDP Exploration Using Analogies
Roderick, Melrose, Nagarajan, Vaishnavh, Kolter, J. Zico
A key challenge in applying reinforcement learning to safety-critical domains is understanding how to balance exploration (needed to attain good performance on the task) with safety (needed to avoid catastrophic failure). Although a growing line of work in reinforcement learning has investigated this area of "safe exploration," most existing techniques either 1) do not guarantee safety during the actual exploration process; and/or 2) limit the problem to a priori known and/or deterministic transition dynamics with strong smoothness assumptions. Addressing this gap, we propose Analogous Safe-state Exploration (ASE), an algorithm for provably safe exploration in MDPs with unknown, stochastic dynamics. Our method exploits analogies between state-action pairs to safely learn a near-optimal policy in a PAC-MDP sense. Additionally, ASE also guides exploration towards the most task-relevant states, which empirically results in significant improvements in terms of sample efficiency, when compared to existing methods. Source code for the experiments is available at https://github.com/locuslab/ase.
Sharp Analysis of Smoothed Bellman Error Embedding
Touati, Ahmed, Vincent, Pascal
The \textit{Smoothed Bellman Error Embedding} algorithm~\citep{dai2018sbeed}, known as SBEED, was proposed as a provably convergent reinforcement learning algorithm with general nonlinear function approximation. It has been successfully implemented with neural networks and achieved strong empirical results. In this work, we study the theoretical behavior of SBEED in batch-mode reinforcement learning. We prove a near-optimal performance guarantee that depends on the representation power of the used function classes and a tight notion of the distribution shift. Our results improve upon prior guarantees for SBEED in ~\citet{dai2018sbeed} in terms of the dependence on the planning horizon and on the sample size. Our analysis builds on the recent work of ~\citet{Xie2020} which studies a related algorithm MSBO, that could be interpreted as a \textit{non-smooth} counterpart of SBEED.
GraphOpt: Learning Optimization Models of Graph Formation
Trivedi, Rakshit, Yang, Jiachen, Zha, Hongyuan
Formation mechanisms are fundamental to the study of complex networks, but learning them from observations is challenging. In real-world domains, one often has access only to the final constructed graph, instead of the full construction process, and observed graphs exhibit complex structural properties. In this work, we propose GraphOpt, an end-to-end framework that jointly learns an implicit model of graph structure formation and discovers an underlying optimization mechanism in the form of a latent objective function. The learned objective can serve as an explanation for the observed graph properties, thereby lending itself to transfer across different graphs within a domain. GraphOpt poses link formation in graphs as a sequential decision-making process and solves it using maximum entropy inverse reinforcement learning algorithm. Further, it employs a novel continuous latent action space that aids scalability. Empirically, we demonstrate that GraphOpt discovers a latent objective transferable across graphs with different characteristics. GraphOpt also learns a robust stochastic policy that achieves competitive link prediction performance without being explicitly trained on this task and further enables construction of graphs with properties similar to those of the observed graph.
Explaining Fast Improvement in Online Policy Optimization
Yan, Xinyan, Boots, Byron, Cheng, Ching-An
Online policy optimization (OPO) views policy optimization for sequential decision making as an online learning problem. In this framework, the algorithm designer defines a sequence of online loss functions such that the regret rate in online learning implies the policy convergence rate and the minimal loss witnessed by the policy class determines the policy performance bias. This reduction technique has been successfully applied to solving various policy optimization problems, including imitation learning, structured prediction, and system identification. Interestingly, the policy improvement speed observed in practice is usually much faster than existing theory suggests. In this work, we provide an explanation of this fast policy improvement phenomenon. Let $\epsilon$ denote the policy class bias and assume the online loss functions are convex, smooth, and non-negative. We prove that, after $N$ rounds of OPO with stochastic feedback, the policy converges in $\tilde{O}(1/N + \sqrt{\epsilon/N})$ in both expectation and high probability. In other words, we show that adopting a sufficiently expressive policy class in OPO has two benefits: both the convergence rate increases and the performance bias decreases, as the policy class becomes reasonably rich. This new theoretical insight is further verified in an online imitation learning experiment.
Guided Exploration with Proximal Policy Optimization using a Single Demonstration
Libardi, Gabriele, De Fabritiis, Gianni
Solving sparse reward tasks through exploration is one of the major challenges in deep reinforcement learning, especially in three-dimensional, partially-observable environments. Critically, the algorithm proposed in this article uses a single human demonstration to solve hard-exploration problems. We train an agent on a combination of demonstrations and own experience to solve problems with variable initial conditions. We adapt this idea and integrate it with the proximal policy optimization (PPO). The agent is able to increase its performance and to tackle harder problems by replaying its own past trajectories prioritizing them based on the obtained reward and the maximum value of the trajectory. We compare different variations of this algorithm to behavioral cloning on a set of hard-exploration tasks in the Animal-AI Olympics environment. To the best of our knowledge, learning a task in a three-dimensional environment with comparable difficulty has never been considered before using only one human demonstration.
Predictive Maintenance for Edge-Based Sensor Networks: A Deep Reinforcement Learning Approach
Ong, Kevin Shen Hoong, Niyato, Dusit, Yuen, Chau
Failure of mission-critical equipment interrupts production and results in monetary loss. The risk of unplanned equipment downtime can be minimized through Predictive Maintenance of revenue generating assets to ensure optimal performance and safe operation of equipment. However, the increased sensorization of the equipment generates a data deluge, and existing machine-learning based predictive model alone becomes inadequate for timely equipment condition predictions. In this paper, a model-free Deep Reinforcement Learning algorithm is proposed for predictive equipment maintenance from an equipment-based sensor network context. Within each equipment, a sensor device aggregates raw sensor data, and the equipment health status is analyzed for anomalous events. Unlike traditional black-box regression models, the proposed algorithm self-learns an optimal maintenance policy and provides actionable recommendation for each equipment. Our experimental results demonstrate the potential for broader range of equipment maintenance applications as an automatic learning framework.
Counterfactual Data Augmentation using Locally Factored Dynamics
Pitis, Silviu, Creager, Elliot, Garg, Animesh
Many dynamic processes, including common scenarios in robotic control and reinforcement learning (RL), involve a set of interacting subprocesses. Though the subprocesses are not independent, their interactions are often sparse, and the dynamics at any given time step can often be decomposed into locally independent causal mechanisms. Such local causal structures can be leveraged to improve the sample efficiency of sequence prediction and off-policy reinforcement learning. We formalize this by introducing local causal models (LCMs), which are induced from a global causal model by conditioning on a subset of the state space. We propose an approach to inferring these structures given an object-oriented state representation, as well as a novel algorithm for model-free Counterfactual Data Augmentation (CoDA). CoDA uses local structures and an experience replay to generate counterfactual experiences that are causally valid in the global model. We find that CoDA significantly improves the performance of RL agents in locally factored tasks, including the batch-constrained and goal-conditioned settings.
Jump Operator Planning: Goal-Conditioned Policy Ensembles and Zero-Shot Transfer
Ringstrom, Thomas J., Hasanbeig, Mohammadhosein, Abate, Alessandro
In Hierarchical Control, compositionality, abstraction, and task-transfer are crucial for designing versatile algorithms which can solve a variety of problems with maximal representational reuse. We propose a novel hierarchical and compositional framework called Jump-Operator Dynamic Programming for quickly computing solutions within a super-exponential space of sequential sub-goal tasks with ordering constraints, while also providing a fast linearly-solvable algorithm as an implementation. This approach involves controlling over an ensemble of reusable goal-conditioned polices functioning as temporally extended actions, and utilizes transition operators called feasibility functions, which are used to summarize initial-to-final state dynamics of the polices. Consequently, the added complexity of grounding a high-level task space onto a larger ambient state-space can be mitigated by optimizing in a lower-dimensional subspace defined by the grounding, substantially improving the scalability of the algorithm while effecting transferable solutions. We then identify classes of objective functions on this subspace whose solutions are invariant to the grounding, resulting in optimal zero-shot transfer.