Goto

Collaborating Authors

 Reinforcement Learning


Adaptive Stress Testing without Domain Heuristics using Go-Explore

arXiv.org Machine Learning

Recently, reinforcement learning (RL) has been used as a tool for finding failures in autonomous systems. During execution, the RL agents often rely on some domain-specific heuristic reward to guide them towards finding failures, but constructing such a heuristic may be difficult or infeasible. Without a heuristic, the agent may only receive rewards at the time of failure, or even rewards that guide it away from failures. For example, some approaches give rewards for taking more-likely actions, because we want to find more-likely failures. However, the agent may then learn to only take likely actions, and may not be able to find a failure at all. Consequently, the problem becomes a hard-exploration problem, where rewards do not aid exploration. A new algorithm, go-explore (GE), has recently set new records on benchmarks from the hard-exploration field. We apply GE to adaptive stress testing (AST), one example of an RL-based falsification approach that provides a way to search for the most-likely failure scenario. We simulate a scenario where an autonomous vehicle drives while a pedestrian is crossing the road. We demonstrate that GE is able to find failures without domain-specific heuristics, such as the distance between the car and the pedestrian, on scenarios that other RL techniques are unable to solve. Furthermore, inspired by the robustification phase of GE, we demonstrate that the backwards algorithm (BA) improves the failures found by other RL techniques.


CURL: Contrastive Unsupervised Representations for Reinforcement Learning

arXiv.org Machine Learning

We present CURL: Contrastive Unsupervised Representations for Reinforcement Learning. CURL extracts high-level features from raw pixels using contrastive learning and performs off-policy control on top of the extracted features. CURL outperforms prior pixel-based methods, both model-based and model-free, on complex tasks in the DeepMind Control Suite and Atari Games showing 2.8x and 1.6x performance gains respectively at the 100K interaction steps benchmark. On the DeepMind Control Suite, CURL is the first image-based algorithm to nearly match the sample-efficiency and performance of methods that use state-based features.


Solving the scalarization issues of Advantage-based Reinforcement Learning Algorithms

arXiv.org Machine Learning

In this paper we investigate some of the issues that arise from the scalarization of the multi-objective optimization problem in the Advantage Actor Critic (A2C) reinforcement learning algorithm. We show how a naive scalarization leads to gradients overlapping and we also argue that the entropy regularization term just inject uncontrolled noise into the system. We propose two methods: one to avoid gradient overlapping (NOG) but keeping the same loss formulation; and one to avoid the noise injection (TE) but generating action distributions with a desired entropy. A comprehensive pilot experiment has been carried out showing how using our proposed methods speeds up the training of 210%. We argue how the proposed solutions can be applied to all the Advantage based reinforcement learning algorithms.


Resource Management for Blockchain-enabled Federated Learning: A Deep Reinforcement Learning Approach

arXiv.org Machine Learning

Blockchain-enabled Federated Learning (BFL) enables model updates of Federated Learning (FL) to be stored in the blockchain in a secure and reliable manner. However, the issue of BFL is that the training latency may increase due to the blockchain mining process. The other issue is that mobile devices in BFL have energy and CPU constraints that may reduce the system lifetime and training efficiency. To address these issues, the Machine Learning Model Owner (MLMO) needs to (i) decide how much data and energy that the mobile devices use for the training and (ii) determine the mining difficulty to minimize the training latency and energy consumption while achieving the target model accuracy. Under the uncertainty of the BFL environment, it is challenging for the MLMO to determine the optimal decisions. We propose to use the Deep Reinforcement Learning (DRL) to derive the optimal decisions for the MLMO.


RL for Planning and Planning for RL

#artificialintelligence

The figure above illustrates the method:(a) Goal-conditioned RL often fails to reach distant goals, but can successfully reach the goal if starting nearby (inside the green region). Reinforcement learning (RL) has seen a lot of progress over the past few years, tackling increasingly complex tasks. Much of this progress has been enabled by combining existing RL algorithms with powerful function approximators (i.e., neural networks). Function approximators have enabled researchers to apply RL to tasks with high-dimensional inputs without hand-crafting representations, distance metrics, or low-level controllers. However, function approximators have not come for free, and their cost is reflected in notoriously challenging optimization: deep RL algorithms are famously unstable and sensitive to hyperparameters.


State-Only Imitation Learning for Dexterous Manipulation

arXiv.org Machine Learning

Dexterous manipulation has been a long-standing challenge in robotics. Recently, modern model-free RL has demonstrated impressive results on a number of problems. However, complex domains like dexterous manipulation remain a challenge for RL due to the poor sample complexity. To address this, current approaches employ expert demonstrations in the form of state-action pairs, which are difficult to obtain for real-world settings such as learning from videos. In this work, we move toward a more realistic setting and explore state-only imitation learning. To tackle this setting, we train an inverse dynamics model and use it to predict actions for state-only demonstrations. The inverse dynamics model and the policy are trained jointly. Our method performs on par with state-action approaches and considerably outperforms RL alone. By not relying on expert actions, we are able to learn from demonstrations with different dynamics, morphologies, and objects.


Ultrasound-Guided Robotic Navigation with Deep Reinforcement Learning

arXiv.org Machine Learning

In this paper we introduce the first reinforcement learning (RL) based robotic navigation method which utilizes ultrasound (US) images as an input. Our approach combines state-of-the-art RL techniques, specifically deep Q-networks (DQN) with memory buffers and a binary classifier for deciding when to terminate the task. Our method is trained and evaluated on an in-house collected data-set of 34 volunteers and when compared to pure RL and supervised learning (SL) techniques, it performs substantially better, which highlights the suitability of RL navigation for US-guided procedures. When testing our proposed model, we obtained a 82.91% chance of navigating correctly to the sacrum from 165 different starting positions on 5 different unseen simulated environments.


Stochastic Approximation with Markov Noise: Analysis and applications in reinforcement learning

arXiv.org Machine Learning

We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by "controlled" Markov noise. In particular, the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time scales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Using a special case of our results, we present a solution to the off-policy convergence problem for temporal-difference learning with linear function approximation. We compile several aspects of the dynamics of stochastic approximation algorithms with Markov iterate-dependent noise when the iterates are not known to be stable beforehand. We achieve the same by extending the lock-in probability (i.e. the probability of convergence to a specific attractor of the limiting o.d.e. given that the iterates are in its domain of attraction after a sufficiently large number of iterations (say) n_0) framework to such recursions. We use these results to prove almost sure convergence of the iterates to the specified attractor when the iterates satisfy an "asymptotic tightness" condition. This, in turn, is shown to be useful in analyzing the tracking ability of general "adaptive" algorithms. Finally, we obtain the first informative error bounds on function approximation for the policy evaluation algorithm proposed by Basu et al. when the aim is to find the risk-sensitive cost represented using exponential utility. We show that this happens due to the absence of difference term in the earlier bound which is always present in all our bounds when the state space is large.


Guided Dialog Policy Learning without Adversarial Learning in the Loop

arXiv.org Artificial Intelligence

Reinforcement-based training methods have emerged as the most popular choice to train an efficient and effective dialog policy. However, these methods are suffering from sparse and unstable reward signals usually returned from the user simulator at the end of the dialog. Besides, the reward signal is manually designed by human experts which requires domain knowledge. A number of adversarial learning methods have been proposed to learn the reward function together with the dialog policy. However, to alternatively update the dialog policy and the reward model on the fly, the algorithms to update the dialog policy are limited to policy gradient-based algorithms, such as REINFORCE and PPO. Besides, the alternative training of the dialog agent and the reward model can easily get stuck in local optimum or result in mode collapse. In this work, we propose to decompose the previous adversarial training into two different steps. We first train the discriminator with an auxiliary dialog generator and then incorporate this trained reward model to a common reinforcement learning method to train a high-quality dialog agent. This approach is applicable to both on-policy and off-policy reinforcement learning methods. By conducting several experiments, we show the proposed methods can achieve remarkable task success and its potential to transfer knowledge from existing domains to a new domain.


Online Constrained Model-based Reinforcement Learning

arXiv.org Machine Learning

Applying reinforcement learning to robotic systems poses a number of challenging problems. A key requirement is the ability to handle continuous state and action spaces while remaining within a limited time and resource budget. Additionally, for safe operation, the system must make robust decisions under hard constraints. To address these challenges, we propose a model based approach that combines Gaussian Process regression and Receding Horizon Control. Using sparse spectrum Gaussian Processes, we extend previous work by updating the dynamics model incrementally from a stream of sensory data. This results in an agent that can learn and plan in real-time under non-linear constraints. We test our approach on a cart pole swing-up environment and demonstrate the benefits of online learning on an autonomous racing task. The environment's dynamics are learned from limited training data and can be reused in new task instances without retraining.