Reinforcement Learning
Adaptive Covariate Acquisition for Minimizing Total Cost of Classification
Andrade, Daniel, Okajima, Yuzuru
In some applications, acquiring covariates comes at a cost which is not negligible. For example in the medical domain, in order to classify whether a patient has diabetes or not, measuring glucose tolerance can be expensive. Assuming that the cost of each covariate, and the cost of misclassification can be specified by the user, our goal is to minimize the (expected) total cost of classification, i.e. the cost of misclassification plus the cost of the acquired covariates. We formalize this optimization goal using the (conditional) Bayes risk and describe the optimal solution using a recursive procedure. Since the procedure is computationally infeasible, we consequently introduce two assumptions: (1) the optimal classifier can be represented by a generalized additive model, (2) the optimal sets of covariates are limited to a sequence of sets of increasing size. We show that under these two assumptions, a computationally efficient solution exists. Furthermore, on several medical datasets, we show that the proposed method achieves in most situations the lowest total costs when compared to various previous methods. Finally, we weaken the requirement on the user to specify all misclassification costs by allowing the user to specify the minimally acceptable recall (target recall). Our experiments confirm that the proposed method achieves the target recall while minimizing the false discovery rate and the covariate acquisition costs better than previous methods.
Disentangling Controllable Object through Video Prediction Improves Visual Reinforcement Learning
Zhong, Yuanyi, Schwing, Alexander, Peng, Jian
In many vision-based reinforcement learning (RL) problems, the agent controls a movable object in its visual field, e.g., the player's avatar in video games and the robotic arm in visual grasping and manipulation. Leveraging action-conditioned video prediction, we propose an end-to-end learning framework to disentangle the controllable object from the observation signal. The disentangled representation is shown to be useful for RL as additional observation channels to the agent. Experiments on a set of Atari games with the popular Double DQN algorithm demonstrate improved sample efficiency and game performance (from 222.8% to 261.4% measured in normalized game scores, with prediction bonus reward).
Discrete Action On-Policy Learning with Action-Value Critic
Yue, Yuguang, Tang, Yunhao, Yin, Mingzhang, Zhou, Mingyuan
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension, making it challenging to apply existing on-policy gradient based deep RL algorithms efficiently. To effectively operate in multidimensional discrete action spaces, we construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. We follow rigorous statistical analysis to design how to generate and combine these correlated actions, and how to sparsify the gradients by shutting down the contributions from certain dimensions. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques. We demonstrate these properties on OpenAI Gym benchmark tasks, and illustrate how discretizing the action space could benefit the exploration phase and hence facilitate convergence to a better local optimal solution thanks to the flexibility of discrete policy.
Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration
Colas, Cรฉdric, Karch, Tristan, Lair, Nicolas, Dussoux, Jean-Michel, Moulin-Frier, Clรฉment, Dominey, Peter Ford, Oudeyer, Pierre-Yves
Autonomous reinforcement learning agents must be intrinsically motivated to explore their environment, discover potential goals, represent them and learn how to achieve them. As children do the same, they benefit from exposure to language, using it to formulate goals and imagine new ones as they learn their meaning. In our proposed learning architecture (IMAGINE), the agent freely explores its environment and turns natural language descriptions of interesting interactions from a social partner into potential goals. IMAGINE learns to represent goals by jointly learning a language model and a goal-conditioned reward function. Just like humans, our agent uses language compositionality to generate new goals by composing known ones. Leveraging modular model architectures based on Deep Sets and gated-attention mechanisms, IMAGINE autonomously builds a repertoire of behaviors and shows good zero-shot generalization properties for various types of generalization. When imagining its own goals, the agent leverages zero-shot generalization of the reward function to further train on imagined goals and refine its behavior. We present experiments in a simulated domain where the agent interacts with procedurally generated scenes containing objects of various types and colors, discovers goals, imagines others and learns to achieve them.
Estimating Q(s,s') with Deep Deterministic Dynamics Gradients
Edwards, Ashley D., Sahni, Himanshu, Liu, Rosanne, Hung, Jane, Jain, Ankit, Wang, Rui, Ecoffet, Adrien, Miconi, Thomas, Isbell, Charles, Yosinski, Jason
In this paper, we introduce a novel form of value function, $Q(s, s')$, that expresses the utility of transitioning from a state $s$ to a neighboring state $s'$ and then acting optimally thereafter. In order to derive an optimal policy, we develop a forward dynamics model that learns to make next-state predictions that maximize this value. This formulation decouples actions from values while still learning off-policy. We highlight the benefits of this approach in terms of value function transfer, learning within redundant action spaces, and learning off-policy from state observations generated by sub-optimal or completely random policies. Code and videos are available at \url{sites.google.com/view/qss-paper}.
Researchers Improve Robotic Arm Used in Surgery
Facebook has recently created an algorithm that enhances an AI agent's ability to navigate an environment, letting the agent determine the shortest route through new environments without access to a map. While mobile robots typically have a map programmed into them, the new algorithm that Facebook designed could enable the creation of robots that can navigate environments without the need for maps. According to a post created by Facebook researchers, a major challenge for robot navigation is endowing AI systems with the ability to navigate through novel environments and reaching programmed destinations without a map. In order to tackle this challenge, Facebook created a reinforcement learning algorithm distributed across multiple learners. The algorithm was called decentralized distributed proximal policy optimization (DD-PPO).
Risk-Aware Energy Scheduling for Edge Computing with Microgrid: A Multi-Agent Deep Reinforcement Learning Approach
Munir, Md. Shirajum, Abedin, Sarder Fakhrul, Tran, Nguyen H., Han, Zhu, Huh, Eui Nam, Hong, Choong Seon
In recent years, multi-access edge computing (MEC) is a key enabler for handling the massive expansion of Internet of Things (IoT) applications and services. However, energy consumption of a MEC network depends on volatile tasks that induces risk for energy demand estimations. As an energy supplier, a microgrid can facilitate seamless energy supply. However, the risk associated with energy supply is also increased due to unpredictable energy generation from renewable and non-renewable sources. Especially, the risk of energy shortfall is involved with uncertainties in both energy consumption and generation. In this paper, we study a risk-aware energy scheduling problem for a microgrid-powered MEC network. First, we formulate an optimization problem considering the conditional value-at-risk (CVaR) measurement for both energy consumption and generation, where the objective is to minimize the loss of energy shortfall of the MEC networks and we show this problem is an NP-hard problem. Second, we analyze our formulated problem using a multi-agent stochastic game that ensures the joint policy Nash equilibrium, and show the convergence of the proposed model. Third, we derive the solution by applying a multi-agent deep reinforcement learning (MADRL)-based asynchronous advantage actor-critic (A3C) algorithm with shared neural networks. This method mitigates the curse of dimensionality of the state space and chooses the best policy among the agents for the proposed problem. Finally, the experimental results establish a significant performance gain by considering CVaR for high accuracy energy scheduling of the proposed model than both the single and random agent models.
On the Search for Feedback in Reinforcement Learning
Wang, Ran, Parunandi, Karthikeya S., Yu, Dan, Kalathil, Dileep, Chakravorty, Suman
This paper addresses the problem of learning the optimal feedback policy for a nonlinear stochastic dynamical system with continuous state space, continuous action space and unknown dynamics. Feedback policies are complex objects that typically need a large dimensional parametrization, which makes Reinforcement Learning algorithms that search for an optimum in this large parameter space, sample inefficient and subject to high variance. We propose a "decoupling" principle that drastically reduces the feedback parameter space while still remaining near-optimal to the fourth-order in a small noise parameter. Based on this principle, we propose a decoupled data-based control (D2C) algorithm that addresses the stochastic control problem: first, an open-loop deterministic trajectory optimization problem is solved using a black-box simulation model of the dynamical system. Then, a linear closed-loop control is developed around this nominal trajectory using only a simulation model. Empirical evidence suggests significant reduction in training time, as well as the training variance, compared to other state of the art Reinforcement Learning algorithms.
Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences
Brown, Daniel S., Coleman, Russell, Srinivasan, Ravi, Niekum, Scott
Bayesian reward learning from demonstrations enables rigorous safety and uncertainty analysis when performing imitation learning. However, Bayesian reward learning methods are typically computationally intractable for complex control problems. We propose a highly efficient Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by first pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference. We evaluate our proposed approach on the task of learning to play Atari games from demonstrations, without access to the game score. For Atari games our approach enables us to generate 100,000 samples from the posterior over reward functions in only 5 minutes using a personal laptop. Furthermore, our proposed approach achieves comparable or better imitation learning performance than state-of-the-art methods that only find a point estimate of the reward function. Finally, we show that our approach enables efficient high-confidence policy performance bounds. We show that these high-confidence performance bounds can be used to rank the performance and risk of a variety of evaluation policies, despite not having samples of the reward function. We also show evidence that high-confidence performance bounds can be used to detect reward hacking in complex imitation learning problems.
GenDICE: Generalized Offline Estimation of Stationary Values
Zhang, Ruiyi, Dai, Bo, Li, Lihong, Schuurmans, Dale
An important problem that arises in reinforcement learning and Monte Carlo methods is estimating quantities defined by the stationary distribution of a Markov chain. In many real-world applications, access to the underlying transition operator is limited to a fixed set of data that has already been collected, without additional interaction with the environment being available. We show that consistent estimation remains possible in this challenging scenario, and that effective estimation can still be achieved in important applications. Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions, derived from fundamental properties of the stationary distribution, and exploiting constraint reformulations based on variational divergence minimization. The resulting algorithm, GenDICE, is straightforward and effective. We prove its consistency under general conditions, provide an error analysis, and demonstrate strong empirical performance on benchmark problems, including off-line PageRank and off-policy policy evaluation.