Goto

Collaborating Authors

 Reinforcement Learning


Policy Entropy for Out-of-Distribution Classification

arXiv.org Artificial Intelligence

One critical prerequisite for the deployment of reinforcement learning systems in the real world is the ability to reliably detect situations on which the agent was not trained. Such situations could lead to potential safety risks when wrong predictions lead to the execution of harmful actions. In this work, we propose PEOC, a new policy entropy based out-of-distribution classifier that reliably detects unencountered states in deep reinforcement learning. It is based on using the entropy of an agent's policy as the classification score of a one-class classifier. We evaluate our approach using a procedural environment generator. Results show that PEOC is highly competitive against state-of-the-art one-class classification algorithms on the evaluated environments. Furthermore, we present a structured process for benchmarking out-of-distribution classification in reinforcement learning.


Dynamic Value Estimation for Single-Task Multi-Scene Reinforcement Learning

arXiv.org Artificial Intelligence

Training deep reinforcement learning agents on environments with multiple levels / scenes / conditions from the same task, has become essential for many applications aiming to achieve generalization and domain transfer from simulation to the real world [1,2]. While such a strategy is helpful with generalization, the use of multiple scenes significantly increases the variance of samples collected for policy gradient computations. Current methods continue to view this collection of scenes as a single Markov Decision Process (MDP) with a common value function; however, we argue that it is better to treat the collection as a single environment with multiple underlying MDPs. To this end, we propose a dynamic value estimation (DVE) technique for these multiple-MDP environments, motivated by the clustering effect observed in the value function distribution across different scenes. The resulting agent is able to learn a more accurate and scene-specific value function estimate (and hence the advantage function), leading to a lower sample variance. Our proposed approach is simple to accommodate with several existing implementations (like PPO, A3C) and results in consistent improvements for a range of ProcGen environments and the AI2-THOR framework based visual navigation task.


Should artificial agents ask for help in human-robot collaborative problem-solving?

arXiv.org Artificial Intelligence

Transferring as fast as possible the functioning of our brain to artificial intelligence is an ambitious goal that would help advance the state of the art in AI and robotics. It is in this perspective that we propose to start from hypotheses derived from an empirical study in a human-robot interaction and to verify if they are validated in the same way for children as for a basic reinforcement learning algorithm. Thus, we check whether receiving help from an expert when solving a simple close-ended task (the Towers of Hano\"i) allows to accelerate or not the learning of this task, depending on whether the intervention is canonical or requested by the player. Our experiences have allowed us to conclude that, whether requested or not, a Q-learning algorithm benefits in the same way from expert help as children do.


Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO

arXiv.org Machine Learning

We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms: Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO). Specifically, we investigate the consequences of "code-level optimizations:" algorithm augmentations found only in implementations or described as auxiliary details to the core algorithm. Seemingly of secondary importance, such optimizations turn out to have a major impact on agent behavior. Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function. These insights show the difficulty and importance of attributing performance gains in deep reinforcement learning. Code for reproducing our results is available at https://github.com/MadryLab/implementation-matters .


Meta-Reinforcement Learning for Trajectory Design in Wireless UAV Networks

arXiv.org Machine Learning

In this paper, the design of an optimal trajectory for an energy-constrained drone operating in dynamic network environments is studied. In the considered model, a drone base station (DBS) is dispatched to provide uplink connectivity to ground users whose demand is dynamic and unpredictable. In this case, the DBS's trajectory must be adaptively adjusted to satisfy the dynamic user access requests. To this end, a meta-learning algorithm is proposed in order to adapt the DBS's trajectory when it encounters novel environments, by tuning a reinforcement learning (RL) solution. The meta-learning algorithm provides a solution that adapts the DBS in novel environments quickly based on limited former experiences. The meta-tuned RL is shown to yield a faster convergence to the optimal coverage in unseen environments with a considerably low computation complexity, compared to the baseline policy gradient algorithm. Simulation results show that, the proposed meta-learning solution yields a 25% improvement in the convergence speed, and about 10% improvement in the DBS' communication performance, compared to a baseline policy gradient algorithm. Meanwhile, the probability that the DBS serves over 50% of user requests increases about 27%, compared to the baseline policy gradient algorithm.


Generator and Critic: A Deep Reinforcement Learning Approach for Slate Re-ranking in E-commerce

arXiv.org Machine Learning

The slate re-ranking problem considers the mutual influences between items to improve user satisfaction in e-commerce, compared with the point-wise ranking. Previous works either directly rank items by an end to end model, or rank items by a score function that trades-off the point-wise score and the diversity between items. However, there are two main existing challenges that are not well studied: (1) the evaluation of the slate is hard due to the complex mutual influences between items of one slate; (2) even given the optimal evaluation, searching the optimal slate is challenging as the action space is exponentially large. In this paper, we present a novel Generator and Critic slate re-ranking approach, where the Critic evaluates the slate and the Generator ranks the items by the reinforcement learning approach. We propose a Full Slate Critic (FSC) model that considers the real impressed items and avoids the "impressed bias" of existing models. For the Generator, to tackle the Figure 1: The return list when searching "smart watch" problem of large action space, we propose a new exploration reinforcement learning algorithm, called PPO-Exploration. Experimental results show that the FSC model significantly outperforms the state of the art slate evaluation methods, and the PPO-Exploration To improve the diversity of the list, a series of research works, algorithm outperforms the existing reinforcement learning methods MMR, IA-Select, xQuAD and DUM [1, 4, 9, 24] are proposed to rank substantially. The Generator and Critic approach improves both items by weighted functions that tradeoff the user-item scores and the slate efficiency(4% gmv and 5% number of orders) and diversity the diversities of items. However, these methods ignore the impact in live experiments on one of the largest e-commerce websites in of diversity on the efficiency of the list.


Gradient Monitored Reinforcement Learning

arXiv.org Machine Learning

This paper presents a novel neural network training approach for faster convergence and better generalization abilities in deep reinforcement learning. Particularly, we focus on the enhancement of training and evaluation performance in reinforcement learning algorithms by systematically reducing gradient's variance and thereby providing a more targeted learning process. The proposed method which we term as Gradient Monitoring(GM), is an approach to steer the learning in the weight parameters of a neural network based on the dynamic development and feedback from the training process itself. We propose different variants of the GM methodology which have been proven to increase the underlying performance of the model. The one of the proposed variant, Momentum with Gradient Monitoring (M-WGM), allows for a continuous adjustment of the quantum of back-propagated gradients in the network based on certain learning parameters. We further enhance the method with Adaptive Momentum with Gradient Monitoring (AM-WGM) method which allows for automatic adjustment between focused learning of certain weights versus a more dispersed learning depending on the feedback from the rewards collected. As a by-product, it also allows for automatic derivation of the required deep network sizes during training as the algorithm automatically freezes trained weights. The approach is applied to two discrete (Multi-Robot Co-ordination problem and Atari games) and one continuous control task (MuJoCo) using Advantage Actor-Critic (A2C) and Proximal Policy Optimization (PPO) respectively. The results obtained particularly underline the applicability and performance improvements of the methods in terms of generalization capability.


Learning visual servo policies via planner cloning

arXiv.org Artificial Intelligence

This algorithm differs from Visual servoing in novel environments is an important AGGREVATE because problem. Given images produced by a camera, a visual servo it incorporates the value control policy guides a grasped part into a desired pose penalties and from DQfD relative to the environment. This problem appears in many because it uses supervised situations: reaching, grasping, peg insertion, stacking, machine targets rather than TD assembly tasks, etc. Whereas classical approaches to the targets. We compare PQC problem [6, 3, 27] typically make strong assumptions about the with several baselines and environment (fiducials, known object geometries, etc.), there algorithm ablations and has been a surge of interest recently in using deep learning show that it outperforms methods to solve these problems in more unstructured settings all these variations on two that incorporate novel objects [29, 14, 26, 8, 21, 28, 12, 13].


Automatic Discovery of Interpretable Planning Strategies

arXiv.org Artificial Intelligence

When making decisions, people often overlook critical information or are overly swayed by irrelevant information. A common approach to mitigate these biases is to provide decision-makers, especially professionals such as medical doctors, with decision aids, such as decision trees and flowcharts. Designing effective decision aids is a difficult problem. We propose that recently developed reinforcement learning methods for discovering clever heuristics for good decision-making can be partially leveraged to assist human experts in this design process. One of the biggest remaining obstacles to leveraging the aforementioned methods for improving human decision-making is that the policies they learn are opaque to people. To solve this problem, we introduce AI-Interpret: a general method for transforming idiosyncratic policies into simple and interpretable descriptions. Our algorithm combines recent advances in imitation learning and program induction with a new clustering method for identifying a large subset of demonstrations that can be accurately described by a simple, high-performing decision rule. We evaluate our new AI-Interpret algorithm and employ it to translate information-acquisition policies discovered through metalevel reinforcement learning. The results of three large behavioral experiments showed that the provision of decision rules as flowcharts significantly improved people's planning strategies and decisions across three different classes of sequential decision problems. Furthermore, a series of ablation studies confirmed that our AI-Interpret algorithm was critical to the discovery of interpretable decision rules and that it is ready to be applied to other reinforcement learning problems. We conclude that the methods and findings presented in this article are an important step towards leveraging automatic strategy discovery to improve human decision-making.


Reinforcement Learning with Iterative Reasoning for Merging in Dense Traffic

arXiv.org Artificial Intelligence

To avoid the computational requirements of online methods, we can use reinforcement learning (RL) instead. In RL, In recent years, major progress has been made to deploy the agent interacts with a simulation environment many autonomous vehicles and improve safety. However, certain times prior to execution, and at each simulation episode common driving situations like merging in dense traffic are it improves its strategy. The resulting policy can then be still challenging for autonomous vehicles. Situations like deployed online and is often inexpensive to evaluate. RL the one illustrated in Figure 1 often involve negotiating with provides a flexible framework to automatically find good human drivers.