Reinforcement Learning
Variational Reward Estimator Bottleneck: Learning Robust Reward Estimator for Multi-Domain Task-Oriented Dialog
Park, Jeiyoon, Lee, Chanhee, Kim, Kuekyeng, Lim, Heuiseok
Despite its notable success in adversarial learning approaches to multi-domain task-oriented dialog system, training the dialog policy via adversarial inverse reinforcement learning often fails to balance the performance of the policy generator and reward estimator. During optimization, the reward estimator often overwhelms the policy generator and produces excessively uninformative gradients. We proposes the Variational Reward estimator Bottleneck (VRB), which is an effective regularization method that aims to constrain unproductive information flows between inputs and the reward estimator. The VRB focuses on capturing discriminative features, by exploiting information bottleneck on mutual information. Empirical results on a multi-domain task-oriented dialog dataset demonstrate that the VRB significantly outperforms previous methods.
MM-KTD: Multiple Model Kalman Temporal Differences for Reinforcement Learning
Malekzadeh, Parvin, Salimibeni, Mohammad, Mohammadi, Arash, Assa, Akbar, Plataniotis, Konstantinos N.
There has been an increasing surge of interest on development of advanced Reinforcement Learning (RL) systems as intelligent approaches to learn optimal control policies directly from smart agents' interactions with the environment. Objectives: In a model-free RL method with continuous state-space, typically, the value function of the states needs to be approximated. In this regard, Deep Neural Networks (DNNs) provide an attractive modeling mechanism to approximate the value function using sample transitions. DNN-based solutions, however, suffer from high sensitivity to parameter selection, are prone to overfitting, and are not very sample efficient. A Kalman-based methodology, on the other hand, could be used as an efficient alternative. Such an approach, however, commonly requires a-priori information about the system (such as noise statistics) to perform efficiently. The main objective of this paper is to address this issue. Methods: As a remedy to the aforementioned problems, this paper proposes an innovative Multiple Model Kalman Temporal Difference (MM-KTD) framework, which adapts the parameters of the filter using the observed states and rewards. Moreover, an active learning method is proposed to enhance the sampling efficiency of the system. More specifically, the estimated uncertainty of the value functions are exploited to form the behaviour policy leading to more visits to less certain values, therefore, improving the overall learning sample efficiency. As a result, the proposed MM-KTD framework can learn the optimal policy with significantly reduced number of samples as compared to its DNN-based counterparts. Results: To evaluate performance of the proposed MM-KTD framework, we have performed a comprehensive set of experiments based on three RL benchmarks. Experimental results show superiority of the MM-KTD framework in comparison to its state-of-the-art counterparts.
AI-based Resource Allocation: Reinforcement Learning for Adaptive Auto-scaling in Serverless Environments
Schuler, Lucia, Jamil, Somaya, Kรผhl, Niklas
Serverless computing has emerged as a compelling new paradigm of cloud computing models in recent years. It promises the user services at large scale and low cost while eliminating the need for infrastructure management. On cloud provider side, flexible resource management is required to meet fluctuating demand. It can be enabled through automated provisioning and deprovisioning of resources. A common approach among both commercial and open source serverless computing platforms is workload-based auto-scaling, where a designated algorithm scales instances according to the number of incoming requests. In the recently evolving serverless framework Knative a request-based policy is proposed, where the algorithm scales resources by a configured maximum number of requests that can be processed in parallel per instance, the so-called concurrency. As we show in a baseline experiment, this predefined concurrency level can strongly influence the performance of a serverless application. However, identifying the concurrency configuration that yields the highest possible quality of service is a challenging task due to various factors, e.g. varying workload and complex infrastructure characteristics, influencing throughput and latency. While there has been considerable research into intelligent techniques for optimizing auto-scaling for virtual machine provisioning, this topic has not yet been discussed in the area of serverless computing. For this reason, we investigate the applicability of a reinforcement learning approach, which has been proven on dynamic virtual machine provisioning, to request-based auto-scaling in a serverless framework. Our results show that within a limited number of iterations our proposed model learns an effective scaling policy per workload, improving the performance compared to the default auto-scaling configuration.
Local and Global Explanations of Agent Behavior: Integrating Strategy Summaries with Saliency Maps
Huber, Tobias, Weitz, Katharina, Andrรฉ, Elisabeth, Amir, Ofra
With advances in reinforcement learning (RL), agents are now being developed in high-stakes application domains such as healthcare and transportation. Explaining the behavior of these agents is challenging, as the environments in which they act have large state spaces, and their decision-making can be affected by delayed rewards, making it difficult to analyze their behavior. To address this problem, several approaches have been developed. Some approaches attempt to convey the $\textit{global}$ behavior of the agent, describing the actions it takes in different states. Other approaches devised $\textit{local}$ explanations which provide information regarding the agent's decision-making in a particular state. In this paper, we combine global and local explanation methods, and evaluate their joint and separate contributions, providing (to the best of our knowledge) the first user study of combined local and global explanations for RL agents. Specifically, we augment strategy summaries that extract important trajectories of states from simulations of the agent with saliency maps which show what information the agent attends to. Our results show that the choice of what states to include in the summary (global information) strongly affects people's understanding of agents: participants shown summaries that included important states significantly outperformed participants who were presented with agent behavior in a randomly set of chosen world-states. We find mixed results with respect to augmenting demonstrations with saliency maps (local information), as the addition of saliency maps did not significantly improve performance in most cases. However, we do find some evidence that saliency maps can help users better understand what information the agent relies on in its decision making, suggesting avenues for future work that can further improve explanations of RL agents.
AI Research Considerations for Human Existential Safety (ARCHES)
Critch, Andrew, Krueger, David
Framed in positive terms, this report examines how technical AI research might be steered in a manner that is more attentive to humanity's long-term prospects for survival as a species. In negative terms, we ask what existential risks humanity might face from AI development in the next century, and by what principles contemporary technical research might be directed to address those risks. A key property of hypothetical AI technologies is introduced, called \emph{prepotence}, which is useful for delineating a variety of potential existential risks from artificial intelligence, even as AI paradigms might shift. A set of \auxref{dirtot} contemporary research \directions are then examined for their potential benefit to existential safety. Each research direction is explained with a scenario-driven motivation, and examples of existing work from which to build. The research directions present their own risks and benefits to society that could occur at various scales of impact, and in particular are not guaranteed to benefit existential safety if major developments in them are deployed without adequate forethought and oversight. As such, each direction is accompanied by a consideration of potentially negative side effects.
Robots learning to move like animals
Whether it's a dog chasing after a ball, or a monkey swinging through the trees, animals can effortlessly perform an incredibly rich repertoire of agile locomotion skills. But designing controllers that enable legged robots to replicate these agile behaviors can be a very challenging task. The superior agility seen in animals, as compared to robots, might lead one to wonder: can we create more agile robotic controllers with less effort by directly imitating animals? In this work, we present a framework for learning robotic locomotion skills by imitating animals. Given a reference motion clip recorded from an animal (e.g. a dog), our framework uses reinforcement learning to train a control policy that enables a robot to imitate the motion in the real world.
Curriculum for Reinforcement Learning
A curriculum is an efficient tool for humans to progressively learn from simple concepts to hard problems. It breaks down complex knowledge by providing a sequence of learning steps of increasing difficulty. In this post, we will examine how the idea of curriculum can help reinforcement learning models learn to solve complicated tasks. It sounds like an impossible task if we want to teach integral or derivative to a 3-year-old who does not even know basic arithmetics. That's why education is important, as it provides a systematic way to break down complex knowledge and a nice curriculum for teaching concepts from simple to hard. A curriculum makes learning difficult things easier and approachable for us humans.
Intelligent Residential Energy Management System using Deep Reinforcement Learning
Mathew, Alwyn, Roy, Abhijit, Mathew, Jimson
The rising demand for electricity and its essential nature in today's world calls for intelligent home energy management (HEM) systems that can reduce energy usage. This involves scheduling of loads from peak hours of the day when energy consumption is at its highest to leaner off-peak periods of the day when energy consumption is relatively lower thereby reducing the system's peak load demand, which would consequently result in lesser energy bills, and improved load demand profile. This work introduces a novel way to develop a learning system that can learn from experience to shift loads from one time instance to another and achieve the goal of minimizing the aggregate peak load. This paper proposes a Deep Reinforcement Learning (DRL) model for demand response where the virtual agent learns the task like humans do. The agent gets feedback for every action it takes in the environment; these feedbacks will drive the agent to learn about the environment and take much smarter steps later in its learning stages. Our method outperformed the state of the art mixed integer linear programming (MILP) for load peak reduction. The authors have also designed an agent to learn to minimize both consumers' electricity bills and utilities' system peak load demand simultaneously. The proposed model was analyzed with loads from five different residential consumers; the proposed method increases the monthly savings of each consumer by reducing their electricity bill drastically along with minimizing the peak load on the system when time shiftable loads are handled by the proposed method.
Domain Knowledge Integration By Gradient Matching For Sample-Efficient Reinforcement Learning
Model-free deep reinforcement learning (RL) agents can learn an effective policy directly from repeated interactions with a black-box environment. However in practice, the algorithms often require large amounts of training experience to learn and generalize well. In addition, classic model-free learning ignores the domain information contained in the state transition tuples. Model-based RL, on the other hand, attempts to learn a model of the environment from experience and is substantially more sample efficient, but suffers from significantly large asymptotic bias owing to the imperfect dynamics model. In this paper, we propose a gradient matching algorithm to improve sample efficiency by utilizing target slope information from the dynamics predictor to aid the model-free learner. We demonstrate this by presenting a technique for matching the gradient information from the model-based learner with the model-free component in an abstract low-dimensional space and validate the proposed technique through experimental results that demonstrate the efficacy of this approach.
Deep Reinforcement learning for real autonomous mobile robot navigation in indoor environments
Surmann, Hartmut, Jestel, Christian, Marchel, Robin, Musberg, Franziska, Elhadj, Houssem, Ardani, Mahbube
Deep Reinforcement Learning has been successfully applied in various computer games [8]. However, it is still rarely used in real-world applications, especially for the navigation and continuous control of real mobile robots [13]. Previous approaches lack safety and robustness and/or need a structured environment. In this paper we present our proof of concept for autonomous self-learning robot navigation in an unknown environment for a real robot without a map or planner. The input for the robot is only the fused data from a 2D laser scanner and a RGB-D camera as well as the orientation to the goal. The map of the environment is unknown. The output actions of an Asynchronous Advantage Actor-Critic network (GA3C) are the linear and angular velocities for the robot. The navigator/controller network is pretrained in a high-speed, parallel, and self-implemented simulation environment to speed up the learning process and then deployed to the real robot. To avoid overfitting, we train relatively small networks, and we add random Gaussian noise to the input laser data. The sensor data fusion with the RGB-D camera allows the robot to navigate in real environments with real 3D obstacle avoidance and without the need to fit the environment to the sensory capabilities of the robot. To further increase the robustness, we train on environments of varying difficulties and run 32 training instances simultaneously. Video: supplementary File / YouTube, Code: GitHub