Goto

Collaborating Authors

 Reinforcement Learning


Jointly Pre-training with Supervised, Autoencoder, and Value Losses for Deep Reinforcement Learning

arXiv.org Machine Learning

Deep Reinforcement Learning (DRL) algorithms are known to be data inefficient. One reason is that a DRL agent learns both the feature and the policy tabula rasa. Integrating prior knowledge into DRL algorithms is one way to improve learning efficiency since it helps to build helpful representations. In this work, we consider incorporating human knowledge to accelerate the asynchronous advantage actor-critic (A3C) algorithm by pre-training a small amount of non-expert human demonstrations. We leverage the supervised autoencoder framework and propose a novel pre-training strategy that jointly trains a weighted supervised classification loss, an unsupervised reconstruction loss, and an expected return loss. The resulting pre-trained model learns more useful features compared to independently training in supervised or unsupervised fashion. Our pre-training method drastically improved the learning performance of the A3C agent in Atari games of Pong and MsPacman, exceeding the performance of the state-of-the-art algorithms at a much smaller number of game interactions. Our method is light-weight and easy to implement in a single machine. For reproducibility, our code is available at github.com/gabrieledcjr/DeepRL/tree/A3C-ALA2019


Deep Reinforcement Learning on a Budget: 3D Control and Reasoning Without a Supercomputer

arXiv.org Machine Learning

An important goal of research in Deep Reinforcement Learning in mobile robotics is to train agents capable of solving complex tasks, which require a high level of scene understanding and reasoning from an egocentric perspective. When trained from simulations, optimal environments should satisfy a currently unobtainable combination of high-fidelity photographic observations, massive amounts of different environment configurations and fast simulation speeds. In this paper we argue that research on training agents capable of complex reasoning can be simplified by decoupling from the requirement of high fidelity photographic observations. We present a suite of tasks requiring complex reasoning and exploration in continuous, partially observable 3D environments. The objective is to provide challenging scenarios and a robust baseline agent architecture that can be trained on mid-range consumer hardware in under 24h. Our scenarios combine two key advantages: (i) they are based on a simple but highly efficient 3D environment (ViZ-Doom) which allows high speed simulation (12000fps); (ii) the scenarios provide the user with a range of difficulty settings, in order to identify the limitations of current state of the art algorithms and network architectures. We aim to increase accessibility to the field of Deep-RL by providing baselines for challenging scenarios where new ideas can be iterated on quickly. We argue that the community should be able to address challenging problems in reasoning of mobile agents without the need for a large compute infrastructure.


Planning with Expectation Models

arXiv.org Artificial Intelligence

Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we propose a sound way of using approximate expectation models for MBRL. In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.


Automatic Left Atrial Appendage Orifice Detection for Preprocedural Planning of Appendage Closure

arXiv.org Artificial Intelligence

In preoperative planning of left atrial appendage closure (LAAC) with CT angiography, the assessment of the appendage orifice plays a crucial role in choosing an appropriate LAAC device size and a proper C-arm angulation. However, accurate orifice detection is laborious because of the high anatomic variation of the appendage, as well as the unclear orifice position and orientation in the available views. We propose an automatic orifice detection approach performing a search on the principal medial axis of the appendage, where we present an efficient iterative algorithm to grow the axis from the appendage to the left atrium. We propose to use the axis-to-surface distance of the appendage for efficient and effective detection. To localize the necessary initial seed for growing the medial axis, we train an artificial localization agent using an actor-critic reinforcement learning approach, defining the localization as a sequential decision process. The entire detection process takes only about 8 seconds, and the variance of the detected orifice with respect to annotations from two experts is calculated to be significantly small and less than the inter-observer variance. The proposed orifice search on the medial axis of the appendage comparing only its distance from the surface provides a simple, yet robust solution for orifice detection. While being the first fully automatic approach and providing a detection error below the inter-observer difference, our method improved the detection efficiency by eighteen times compared to the existing solution, therefore, can be potentially useful for physicians.


Efficient and Safe Exploration in Deterministic Markov Decision Processes with Unknown Transition Models

arXiv.org Artificial Intelligence

Process (MDP) using Gaussian processes. In their work, they assumed the transition model is known and that there exists I. INTRODUCTION a predefined safety function. Both of these assumptions can Guaranteeing safety is a vital issue for many modern be quite restrictive when the system is going to operate in robotics systems, such as unmanned aerial vehicles (UAVs), unknown environments. In our work, we plan to address autonomous cars, or domestic robots [1], [2], [3]. One both of these challenges by considering unknown transition approach is to attempt to specify all potential scenarios models, and no access to a predefined safety function.


Guided Meta-Policy Search

arXiv.org Artificial Intelligence

Reinforcement learning (RL) algorithms have demonstrated promising results on complex tasks, yet often require impractical numbers of samples because they learn from scratch. Meta-RL aims to address this challenge by leveraging experience from previous tasks in order to more quickly solve new tasks. However, in practice, these algorithms generally also require large amounts of on-policy experience during the meta-training process, making them impractical for use in many problems. To this end, we propose to learn a reinforcement learning procedure through imitation of expert policies that solve previously-seen tasks. This involves a nested optimization, with RL in the inner loop and supervised imitation learning in the outer loop. Because the outer loop imitation learning can be done with off-policy data, we can achieve significant gains in meta-learning sample efficiency. In this paper, we show how this general idea can be used both for meta-reinforcement learning and for learning fast RL procedures from multi-task demonstration data. The former results in an approach that can leverage policies learned for previous tasks without significant amounts of on-policy data during meta-training, whereas the latter is particularly useful in cases where demonstrations are easy for a person to provide. Across a number of continuous control meta-RL problems, we demonstrate significant improvements in meta-RL sample efficiency in comparison to prior work as well as the ability to scale to domains with visual observations.


Generative predecessor models for sample-efficient imitation learning

arXiv.org Machine Learning

We propose Generative Predecessor Models for Imitation Learning (GPRIL), a novel imitation learning algorithm that matches the state-action distribution to the distribution observed in expert demonstrations, using generative models to reason probabilistically about alternative histories of demonstrated states. We show that this approach allows an agent to learn robust policies using only a small number of expert demonstrations and self-supervised interactions with the environment. We derive this approach from first principles and compare it empirically to a state-of-the-art imitation learning method, showing that it outperforms or matches its performance on two simulated robot manipulation tasks and demonstrate significantly higher sample efficiency by applying the algorithm on a real robot.


Learning Good Representation via Continuous Attention

arXiv.org Machine Learning

In this paper we present our scientific discovery that good representation can be learned via continuous attention during the interaction between Unsupervised Learning(UL) and Reinforcement Learning(RL) modules driven by intrinsic motivation. Specifically, we designed intrinsic rewards generated from UL modules for driving the RL agent to focus on objects for a period of time and to learn good representations of objects for later object recognition task. We evaluate our proposed algorithm in both with and without extrinsic reward settings. Experiments with end-to-end training in simulated environments with applications to few-shot object recognition demonstrated the effectiveness of the proposed algorithm.


Distributed Power Control for Large Energy Harvesting Networks: A Multi-Agent Deep Reinforcement Learning Approach

arXiv.org Artificial Intelligence

In this paper, we develop a multi-agent reinforcement learning (MARL) framework to obtain online power control policies for a large energy harvesting (EH) multiple access channel, when only the causal information about the EH process and wireless channel is available. In the proposed framework, we model the online power control problem as a discrete-time mean-field game (MFG), and leverage the deep reinforcement learning technique to learn the stationary solution of the game in a distributed fashion. We analytically show that the proposed procedure converges to the unique stationary solution of the MFG. Using the proposed framework, the power control policies are learned in a completely distributed fashion. In order to benchmark the performance of the distributed policies, we also develop a deep neural network (DNN) based centralized as well as distributed online power control schemes. Our simulation results show the efficacy of the proposed power control policies. In particular, the DNN based centralized power control policies provide a very good performance for large EH networks for which the design of optimal policies is intractable using the conventional methods such as Markov decision processes. Further, performance of both the distributed policies is close to the throughput achieved by the centralized policies. The work in this paper will appear in part at IEEE ICASSP 2019 [1] and IEEE WiOpt 2019 [2]. This research has been partly supported by the ERC-PoC 727682 CacheMire project. I. INTRODUCTION Internet-of-things (IoT) [3] networks connect a large number of low power sensors whose lifespan is typically limited by the energy that can be stored in their batteries. In this context, the advent of the energy harvesting (EH) technology [4] promises to prolong the lifespan of IoT networks by enabling the nodes to operate by harvesting energy from environmental sources, e.g., the sun, the wind, etc.


Lane Change Decision-making through Deep Reinforcement Learning with Rule-based Constraints

arXiv.org Artificial Intelligence

Autonomous driving decision-making is a great challenge due to the complexity and uncertainty of the traffic environment. Combined with the rule-based constraints, a Deep Q-Network (DQN) based method is applied for autonomous driving lane change decision-making task in this study. Through the combination of high-level lateral decision-making and low-level rule-based trajectory modification, a safe and efficient lane change behavior can be achieved. With the setting of our state representation and reward function, the trained agent is able to take appropriate actions in a real-world-like simulator. The generated policy is evaluated on the simulator for 10 times, and the results demonstrate that the proposed rule-based DQN method outperforms the rule-based approach and the DQN method.