Goto

Collaborating Authors

 Reinforcement Learning


Counterfactual Explanation with Multi-Agent Reinforcement Learning for Drug Target Prediction

arXiv.org Artificial Intelligence

Motivation: Several accurate deep learning models have been proposed to predict drug-target affinity (DTA). However, all of these models are black box hence are difficult to interpret and verify its result, and thus risking acceptance. Explanation is necessary to allow the DTA model more trustworthy. Explanation with counterfactual provides human-understandable examples. Most counterfactual explanation methods only operate on single input data, which are in tabular or continuous forms. In contrast, the DTA model has two discrete inputs. It is challenging for the counterfactual generation framework to optimize both discrete inputs at the same time. Results: We propose a multi-agent reinforcement learning framework, Multi-Agent Counterfactual Drug-target binding Affinity (MACDA), to generate counterfactual explanations for the drug-protein complex. Our proposed framework provides human-interpretable counterfactual instances while optimizing both the input drug and target for counterfactual generation at the same time. The result on the Davis dataset shows the advantages of the proposed MACDA framework compared with previous works.


An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

arXiv.org Artificial Intelligence

A fundamental question in the theory of reinforcement learning is: suppose the optimal $Q$-function lies in the linear span of a given $d$ dimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolved this question in the negative, providing an exponential (in $d$) sample size lower bound, which holds even if the agent has access to a generative model of the environment. One may hope that this information theoretic barrier for RL can be circumvented by further supposing an even more favorable assumption: there exists a \emph{constant suboptimality gap} between the optimal $Q$-value of the best action and that of the second-best action (for all states). The hope is that having a large suboptimality gap would permit easier identification of optimal actions themselves, thus making the problem tractable; indeed, provided the agent has access to a generative model, sample-efficient RL is in fact possible with the addition of this more favorable assumption. This work focuses on this question in the standard online reinforcement learning setting, where our main result resolves this question in the negative: our hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal $Q$-function. Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting. Complementing our negative hardness result, we give two positive results showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption (both implicitly place stronger conditions on the underlying dynamics model).


Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration

arXiv.org Artificial Intelligence

We propose a novel information bottleneck (IB) method named Drop-Bottleneck, which discretely drops features that are irrelevant to the target variable. Drop-Bottleneck not only enjoys a simple and tractable compression objective but also additionally provides a deterministic compressed representation of the input variable, which is useful for inference tasks that require consistent representation. Moreover, it can jointly learn a feature extractor and select features considering each feature dimension's relevance to the target task, which is unattainable by most neural network-based IB methods. We propose an exploration method based on Drop-Bottleneck for reinforcement learning tasks. In a multitude of noisy and reward sparse maze navigation tasks in VizDoom (Kempka et al., 2016) and DM-Lab (Beattie et al., 2016), our exploration method achieves state-of-the-art performance. As a new IB framework, we demonstrate that Drop-Bottleneck outperforms Variational Information Bottleneck (VIB) (Alemi et al., 2017) in multiple aspects including adversarial robustness and dimensionality reduction. Data with noise or task-irrelevant information easily harm the training of a model; for instance, the noisy-TV problem (Burda et al., 2019a) is one of well-known such phenomena in reinforcement learning. If observations from the environment are modified to contain a TV screen, which changes its channel randomly based on the agent's actions, the performance of curiosity-based exploration methods dramatically degrades (Burda et al., 2019a;b; Kim et al., 2019; Savinov et al., 2019). The information bottleneck (IB) theory (Tishby et al., 2000; Tishby & Zaslavsky, 2015) provides a framework for dealing with such task-irrelevant information, and has been actively adopted to exploration in reinforcement learning (Kim et al., 2019; Igl et al., 2019). For an input variable X and a target variable Y, the IB theory introduces another variable Z, which is a compressed representation of X.


Unsupervised Contextual Paraphrase Generation using Lexical Control and Reinforcement Learning

arXiv.org Artificial Intelligence

Customer support via chat requires agents to resolve customer queries with minimum wait time and maximum customer satisfaction. Given that the agents as well as the customers can have varying levels of literacy, the overall quality of responses provided by the agents tend to be poor if they are not predefined. But using only static responses can lead to customer detraction as the customers tend to feel that they are no longer interacting with a human. Hence, it is vital to have variations of the static responses to reduce monotonicity of the responses. However, maintaining a list of such variations can be expensive. Given the conversation context and the agent response, we propose an unsupervised frame-work to generate contextual paraphrases using autoregressive models. We also propose an automated metric based on Semantic Similarity, Textual Entailment, Expression Diversity and Fluency to evaluate the quality of contextual paraphrases and demonstrate performance improvement with Reinforcement Learning (RL) fine-tuning using the automated metric as the reward function.


Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning

arXiv.org Artificial Intelligence

While in the past much of the empirical RL However, analyzing the nature of research has focused on tabular or linear function approximation those environments is often overlooked. In particular, case (Dietterich, 1998; McGovern & Barto, 2001; we still do not have agreeable ways to Konidaris & Barto, 2009), the impressive successes of recent measure the difficulty or solvability of a task, years (and anticipation of domains ripe for subsequent given that each has fundamentally different actions, successes) has spurred the creation of non-tabular benchmarks observations, dynamics, rewards, and can - i.e., continuous control and/or continuous observation be tackled with diverse RL algorithms. In this - in which neural network function approximators are work, we propose policy information capacity effectively a prerequisite (Bellemare et al., 2013; Brockman (PIC) - the mutual information between policy parameters et al., 2016; Tassa et al., 2018). Accordingly, empirical RL and episodic return - and policy-optimal research is presently heavily focused on the use of neural information capacity (POIC) - between policy network function approximators, spurring new algorithmic parameters and episodic optimality - as two developments in both model-free (Mnih et al., 2015; Schulman environment-agnostic, algorithm-agnostic quantitative et al., 2015; Lillicrap et al., 2016; Gu et al., 2016b; metrics for task difficulty. Evaluating our 2017; Haarnoja et al., 2018) and model-based (Chua et al., metrics across toy environments as well as continuous 2018; Janner et al., 2019; Hafner et al., 2020a) RL. control benchmark tasks from OpenAI Gym and DeepMind Control Suite, we empirically Despite the impressive progress of RL algorithms, the analysis demonstrate that these information-theoretic of the RL environments has been difficult and stagnant, metrics have higher correlations with normalized precisely due to the complexity of modern benchmarks and task solvability scores than a variety of alternatives.


Spatial Intention Maps for Multi-Agent Mobile Manipulation

arXiv.org Artificial Intelligence

The ability to communicate intention enables decentralized multi-agent robots to collaborate while performing physical tasks. In this work, we present spatial intention maps, a new intention representation for multi-agent vision-based deep reinforcement learning that improves coordination between decentralized mobile manipulators. In this representation, each agent's intention is provided to other agents, and rendered into an overhead 2D map aligned with visual observations. This synergizes with the recently proposed spatial action maps framework, in which state and action representations are spatially aligned, providing inductive biases that encourage emergent cooperative behaviors requiring spatial coordination, such as passing objects to each other or avoiding collisions. Experiments across a variety of multi-agent environments, including heterogeneous robot teams with different abilities (lifting, pushing, or throwing), show that incorporating spatial intention maps improves performance for different mobile manipulation tasks while significantly enhancing cooperative behaviors.


Assured Learning-enabled Autonomy: A Metacognitive Reinforcement Learning Framework

arXiv.org Artificial Intelligence

Reinforcement learning (RL) agents with pre-specified reward functions cannot provide guaranteed safety across variety of circumstances that an uncertain system might encounter. To guarantee performance while assuring satisfaction of safety constraints across variety of circumstances, an assured autonomous control framework is presented in this paper by empowering RL algorithms with metacognitive learning capabilities. More specifically, adapting the reward function parameters of the RL agent is performed in a metacognitive decision-making layer to assure the feasibility of RL agent. That is, to assure that the learned policy by the RL agent satisfies safety constraints specified by signal temporal logic while achieving as much performance as possible. The metacognitive layer monitors any possible future safety violation under the actions of the RL agent and employs a higher-layer Bayesian RL algorithm to proactively adapt the reward function for the lower-layer RL agent. To minimize the higher-layer Bayesian RL intervention, a fitness function is leveraged by the metacognitive layer as a metric to evaluate success of the lower-layer RL agent in satisfaction of safety and liveness specifications, and the higher-layer Bayesian RL intervenes only if there is a risk of lower-layer RL failure. Finally, a simulation example is provided to validate the effectiveness of the proposed approach.


Robust Multi-Modal Policies for Industrial Assembly via Reinforcement Learning and Demonstrations: A Large-Scale Study

arXiv.org Artificial Intelligence

Over the past several years there has been a considerable research investment into learning-based approaches to industrial assembly, but despite significant progress these techniques have yet to be adopted by industry. We argue that it is the prohibitively large design space for Deep Reinforcement Learning (DRL), rather than algorithmic limitations per se, that are truly responsible for this lack of adoption. Pushing these techniques into the industrial mainstream requires an industry-oriented paradigm which differs significantly from the academic mindset. In this paper we define criteria for industry-oriented DRL, and perform a thorough comparison according to these criteria of one family of learning approaches, DRL from demonstration, against a professional industrial integrator on the recently established NIST assembly benchmark. We explain the design choices, representing several years of investigation, which enabled our DRL system to consistently outperform the integrator baseline in terms of both speed and reliability. Finally, we conclude with a competition between our DRL system and a human on a challenge task of insertion into a randomly moving target. This study suggests that DRL is capable of outperforming not only established engineered approaches, but the human motor system as well, and that there remains significant room for improvement. Videos can be found on our project website: https://sites.google.com/view/shield-nist.


Improving Actor-Critic Reinforcement Learning via Hamiltonian Policy

arXiv.org Machine Learning

Approximating optimal policies in reinforcement learning (RL) is often necessary in many real-world scenarios, which is termed as policy optimization. By viewing the reinforcement learning from the perspective of variational inference (VI), the policy network is trained to obtain the approximate posterior of actions given the optimality criteria. However, in practice, the policy optimization may lead to suboptimal policy estimates due to the amortization gap and insufficient exploration. In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate policy optimization with HMC. As such we choose evolving actions from the base policy according to HMC. First, HMC can improve the policy distribution to better approximate the posterior and hence reduces the amortization gap. Second, HMC can also guide the exploration more to the regions with higher action values, enhancing the exploration efficiency. Instead of directly applying HMC into RL, we propose a new leapfrog operator to simulate the Hamiltonian dynamics. With comprehensive empirical experiments on continuous control baselines, including MuJoCo, PyBullet Roboschool and DeepMind Control Suite, we show that the proposed approach is a data-efficient, and an easy-to-implement improvement over previous policy optimization methods. Besides, the proposed approach can also outperform previous methods on DeepMind Control Suite, which has image-based high-dimensional observation space.


Combining Reward Information from Multiple Sources

arXiv.org Artificial Intelligence

Given two sources of evidence about a latent variable, one can combine the information from both by multiplying the likelihoods of each piece of evidence. However, when one or both of the observation models are misspecified, the distributions will conflict. We study this problem in the setting with two conflicting reward functions learned from different sources. In such a setting, we would like to retreat to a broader distribution over reward functions, in order to mitigate the effects of misspecification. We assume that an agent will maximize expected reward given this distribution over reward functions, and identify four desiderata for this setting. We propose a novel algorithm, Multitask Inverse Reward Design (MIRD), and compare it to a range of simple baselines. While all methods must trade off between conservatism and informativeness, through a combination of theory and empirical results on a toy environment, we find that MIRD and its variant MIRD-IF strike a good balance between the two.