Reinforcement Learning
[D] What are you currently 'stuck' on right now / these days? • r/MachineLearning
Currently I'm searching for a Reinforcement Learning toolkit for autonomous driving to test the influence of several safety aspects during learning as a reward function. So far I have tested OpenAI Gym with the "Neon racer" environment, which does not provide those information. Are there any other toolkits you would suggest me for this purpose?
Object Manipulation Learning by Imitation
We aim to enable robot to learn object manipulation by imitation. Given external observations of demonstrations on object manipulations, we believe that two underlying problems to address in learning by imitation is 1) segment a given demonstration into skills that can be individually learned and reused, and 2) formulate the correct RL (Reinforcement Learning) problem that only considers the relevant aspects of each skill so that the policy for each skill can be effectively learned. Previous works made certain progress in this direction, but none has taken private information into account. The public information is the information that is available in the external observations of demonstration, and the private information is the information that are only available to the agent that executes the actions, such as tactile sensations. Our contribution is that we provide a method for the robot to automatically segment the demonstration of object manipulations into multiple skills, and formulate the correct RL problem for each skill, and automatically decide whether the private information is an important aspect of each skill based on interaction with the world. Our experiment shows that our robot learns to pick up a block, and stack it onto another block by imitating an observed demonstration. The evaluation is based on 1) whether the demonstration is reasonably segmented, 2) whether the correct RL problems are formulated, 3) and whether a good policy is learned.
Lecture 14 Deep Reinforcement Learning
In Lecture 14 we move from supervised learning to reinforcement learning (RL), in which an agent must learn to interact with an environment in order to maximize its reward. We discuss different algorithms for reinforcement learning including Q-Learning, policy gradients, and Actor-Critic. We show how deep reinforcement learning has been used to play Atari games and to achieve super-human Go performance in AlphaGo. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka "deep learning") approaches have greatly advanced the performance of these state-of-the-art visual recognition systems.
Variational Adaptive-Newton Method for Explorative Learning
Khan, Mohammad Emtiyaz, Lin, Wu, Tangkaratt, Voot, Liu, Zuozhu, Nielsen, Didrik
We present the Variational Adaptive Newton (VAN) method which is a black-box optimization method especially suitable for explorative-learning tasks such as active learning and reinforcement learning. Similar to Bayesian methods, VAN estimates a distribution that can be used for exploration, but requires computations that are similar to continuous optimization methods. Our theoretical contribution reveals that VAN is a second-order method that unifies existing methods in distinct fields of continuous optimization, variational inference, and evolution strategies. Our experimental results show that VAN performs well on a wide-variety of learning tasks. This work presents a general-purpose explorative-learning method that has the potential to improve learning in areas such as active learning and reinforcement learning.
Wald-Kernel: Learning to Aggregate Information for Sequential Inference
Sequential hypothesis testing is a desirable decision making strategy in any time sensitive scenario. Compared with fixed sample-size testing, sequential testing is capable of achieving identical probability of error requirements using less samples in average. For a binary detection problem, it is well known that for known density functions accumulating the likelihood ratio statistics is time optimal under a fixed error rate constraint. This paper considers the problem of learning a binary sequential detector from training samples when density functions are unavailable. We formulate the problem as a constrained likelihood ratio estimation which can be solved efficiently through convex optimization by imposing Reproducing Kernel Hilbert Space (RKHS) structure on the log-likelihood ratio function. In addition, we provide a computationally efficient approximated solution for large scale data set. The proposed algorithm, namely Wald-Kernel, is tested on a synthetic data set and two real world data sets, together with previous approaches for likelihood ratio estimation. Our empirical results show that the classifier trained through the proposed technique achieves smaller average sampling cost than previous approaches proposed in the literature for the same error rate.
A unified decision making framework for supply and demand management in microgrid networks
Diddigi, Raghuram Bharadwaj, Danda, Sai Koti Reddy, Narayanam, Krishnasuri, Bhatnagar, Shalabh
This paper considers two important problems - on the supply-side and demand-side respectively and studies both in a unified framework. On the supply side, we study the problem of energy sharing among microgrids with the goal of maximizing profit obtained from selling power while meeting customer demand. On the other hand, under shortage of power, this problem becomes one of deciding the amount of power to be bought with dynamically varying prices. On the demand side, we consider the problem of optimally scheduling the time-adjustable demand - i.e., of loads with flexible time windows in which they can be scheduled. While previous works have treated these two problems in isolation, we combine these problems together and provide for the first time in the literature, a unified Markov decision process (MDP) framework for these problems. We then apply the Q-learning algorithm, a popular model-free reinforcement learning technique, to obtain the optimal policy. Through simulations, we show that our model outperforms the traditional power sharing models.
Multi-Advisor Reinforcement Learning
Laroche, Romain, Fatemi, Mehdi, Romoff, Joshua, van Seijen, Harm
We consider tackling a single-agent RL problem by distributing it to $n$ learners. These learners, called advisors, endeavour to solve the problem from a different focus. Their advice, taking the form of action values, is then communicated to an aggregator, which is in control of the system. We show that the local planning method for the advisors is critical and that none of the ones found in the literature is flawless: the egocentric planning overestimates values of states where the other advisors disagree, and the agnostic planning is inefficient around danger zones. We introduce a novel approach called empathic and discuss its theoretical aspects. We empirically examine and validate our theoretical findings on a fruit collection task.
InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations
Li, Yunzhu, Song, Jiaming, Ermon, Stefano
The goal of imitation learning is to mimic expert behavior without access to an explicit reward signal. Expert demonstrations provided by humans, however, often show significant variability due to latent factors that are typically not explicitly modeled. In this paper, we propose a new algorithm that can infer the latent structure of expert demonstrations in an unsupervised way. Our method, built on top of Generative Adversarial Imitation Learning, can not only imitate complex behaviors, but also learn interpretable and meaningful representations of complex behavioral data, including visual demonstrations. In the driving domain, we show that a model learned from human demonstrations is able to both accurately reproduce a variety of behaviors and accurately anticipate human actions using raw visual inputs. Compared with various baselines, our method can better capture the latent structure underlying expert demonstrations, often recovering semantically meaningful factors of variation in the data.
Reinforcement Learning Algorithm Selection
Laroche, Romain, Feraud, Raphael
The setup is as follows: given an episodic task and a finite number of off-policy RL algorithms, a meta-algorithm has to decide which RL algorithm is in control during the next episode so as to maximize the expected return. The article presents a novel meta-algorithm, called Epochal Stochastic Bandit Algorithm Selection (ESBAS). Its principle is to freeze the policy updates at each epoch, and to leave a rebooted stochastic bandit in charge of the algorithm selection. Under some assumptions, a thorough theoretical analysis demonstrates its near-optimality considering the structural sampling budget limitations. ESBAS is first empirically evaluated on a dialogue task where it is shown to outperform each individual algorithm in most configurations. ESBAS is then adapted to a true online setting where algorithms update their policies after each transition, which we call SSBAS. SSBAS is evaluated on a fruit collection task where it is shown to adapt the stepsize parameter more efficiently than the classical hyperbolic decay, and on an Atari game, where it improves the performance by a wide margin.
REINFORCEjs: Gridworld with Dynamic Programming
Temporal Difference Learning Gridworld Demo // agent parameter spec to play with (this gets eval()'d on Agent reset) var spec {} spec.update This is a toy environment called **Gridworld** that is often used as a toy model in the Reinforcement Learning literature. In this particular case: - **State space**: GridWorld has 10x10 100 distinct states. The start state is the top left cell. The gray cells are walls and cannot be moved to. In this example - **Environment Dynamics**: GridWorld is deterministic, leading to the same new state given each state and action - **Rewards**: The agent receives 1 reward when it is in the center square (the one that shows R 1.0), and -1 reward in a few states (R -1.0 is shown for these).