Reinforcement Learning
Breaking the Curse of Many Agents: Provable Mean Embedding Q-Iteration for Mean-Field Reinforcement Learning
Wang, Lingxiao, Yang, Zhuoran, Wang, Zhaoran
Multi-agent reinforcement learning (MARL) achieves significant empirical successes. However, MARL suffers from the curse of many agents. In this paper, we exploit the symmetry of agents in MARL. In the most generic form, we study a mean-field MARL problem. Such a mean-field MARL is defined on mean-field states, which are distributions that are supported on continuous space. Based on the mean embedding of the distributions, we propose MF-FQI algorithm that solves the mean-field MARL and establishes a non-asymptotic analysis for MF-FQI algorithm. We highlight that MF-FQI algorithm enjoys a "blessing of many agents" property in the sense that a larger number of observed agents improves the performance of MF-FQI algorithm.
On Optimism in Model-Based Reinforcement Learning
Pacchiano, Aldo, Ball, Philip, Parker-Holder, Jack, Choromanski, Krzysztof, Roberts, Stephen
The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL), often coming with strong theoretical guarantees. However, it remains a challenge to scale these approaches to the deep RL paradigm, which has achieved a great deal of attention in recent years. In this paper, we introduce a tractable approach to optimism via noise augmented Markov Decision Processes (MDPs), which we show can obtain a competitive regret bound: $\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{S}||\mathcal{A}| T } )$ when augmenting using Gaussian noise, where $T$ is the total number of environment steps. This tractability allows us to apply our approach to the deep RL setting, where we rigorously evaluate the key factors for success of optimistic model-based RL algorithms, bridging the gap between theory and practice.
Hierarchical Reinforcement Learning for Deep Goal Reasoning: An Expressiveness Analysis
Yuan, Weihang, Muรฑoz-Avila, Hรฉctor
Hierarchical DQN (h-DQN) is a two-level architecture of feedforward neural networks where the meta level selects goals and the lower level takes actions to achieve the goals. We show tasks that cannot be solved by h-DQN, exemplifying the limitation of this type of hierarchical framework (HF). We describe the recurrent hierarchical framework (RHF), generalizing architectures that use a recurrent neural network at the meta level. We analyze the expressiveness of HF and RHF using context-sensitive grammars. We show that RHF is more expressive than HF. We perform experiments comparing an implementation of RHF with two HF baselines; the results corroborate our theoretical findings.
Accelerating Safe Reinforcement Learning with Constraint-mismatched Policies
Yang, Tsung-Yen, Rosca, Justinian, Narasimhan, Karthik, Ramadge, Peter J.
We consider the problem of reinforcement learning when provided with a baseline control policy and a set of constraints that the controlled system must satisfy. The baseline policy might arise from a heuristic, a prior application, a teacher or demonstrator data. The constraints might encode safety, fairness or some application-specific requirements. We want to efficiently use reinforcement learning to adapt the baseline policy to improve performance and satisfy the given constraints when it is applied to the new system. The key challenge is to effectively use the baseline policy (which need not satisfy the current constraints) to aid the learning of a constraint-satisfying policy in the new application. We propose an iterative algorithm for solving this problem. Each iteration is composed of three-steps. The first step performs a policy update to increase the expected reward, the second step performs a projection to minimize the distance between the current policy and the baseline policy, and the last step performs a projection onto the set of policies that satisfy the constraints. This procedure allows the learning process to leverage the baseline policy to achieve faster learning while improving reward performance and satisfying the constraints imposed on the current problem. We analyze the convergence of the proposed algorithm and provide a finite-sample guarantee. Empirical results demonstrate that the algorithm can achieve superior performance, with 10 times fewer constraint violations and around 40% higher reward compared to state-of-the-art methods.
Policy Evaluation and Seeking for Multi-Agent Reinforcement Learning via Best Response
Yan, Rui, Duan, Xiaoming, Shi, Zongying, Zhong, Yisheng, Marden, Jason R., Bullo, Francesco
This paper introduces two metrics (cycle-based and memory-based metrics), grounded on a dynamical game-theoretic solution concept called sink equilibrium, for the evaluation, ranking, and computation of policies in multi-agent learning. We adopt strict best response dynamics (SBRD) to model selfish behaviors at a meta-level for multi-agent reinforcement learning. Our approach can deal with dynamical cyclical behaviors (unlike approaches based on Nash equilibria and Elo ratings), and is more compatible with single-agent reinforcement learning than alpha-rank which relies on weakly better responses. We first consider settings where the difference between largest and second largest underlying metric has a known lower bound. With this knowledge we propose a class of perturbed SBRD with the following property: only policies with maximum metric are observed with nonzero probability for a broad class of stochastic games with finite memory. We then consider settings where the lower bound for the difference is unknown. For this setting, we propose a class of perturbed SBRD such that the metrics of the policies observed with nonzero probability differ from the optimal by any given tolerance. The proposed perturbed SBRD addresses the opponent-induced non-stationarity by fixing the strategies of others for the learning agent, and uses empirical game-theoretic analysis to estimate payoffs for each strategy profile obtained due to the perturbation.
Identifying Cognitive Radars -- Inverse Reinforcement Learning using Revealed Preferences
Krishnamurthy, Vikram, Angley, Daniel, Evans, Robin, Moran, William
We consider an inverse reinforcement learning problem involving us versus an enemy radar equipped with a Bayesian tracker. By observing the emissions of the enemy radar,how can we identify if the radar is cognitive (constrained utility maximizer)? Given the observed sequence of actions taken by the enemy's radar, we consider three problems: (i) Are the enemy radar's actions (waveform choice, beam scheduling) consistent with constrained utility maximization? If so how can we estimate the cognitive radar's utility function that is consistent with its actions. We formulate and solve the problem in terms of the spectra (eigenvalues) of the state and observation noise covariance matrices, and the algebraic Riccati equation. (ii) How to construct a statistical test for detecting a cognitive radar (constrained utility maximization) when we observe the radar's actions in noise or the radar observes our probe signal in noise? We propose a statistical detector with a tight Type-II error bound. (iii) How can we optimally probe (interrogate) the enemy's radar by choosing our state to minimize the Type-II error of detecting if the radar is deploying an economic rational strategy, subject to a constraint on the Type-I detection error? We present a stochastic optimization algorithm to optimize our probe signal. The main analysis framework used in this paper is that of revealed preferences from microeconomics.
Entropic Risk Constrained Soft-Robust Policy Optimization
Russel, Reazul Hasan, Behzadian, Bahram, Petrik, Marek
Having a perfect model to compute the optimal policy is often infeasible in reinforcement learning. It is important in high-stakes domains to quantify and manage risk induced by model uncertainties. Entropic risk measure is an exponential utility-based convex risk measure that satisfies many reasonable properties. In this paper, we propose an entropic risk constrained policy gradient and actor-critic algorithms that are risk-averse to the model uncertainty. We demonstrate the usefulness of our algorithms on several problem domains.
Langevin Dynamics for Inverse Reinforcement Learning of Stochastic Gradient Algorithms
Krishnamurthy, Vikram, Yin, George
Inverse reinforcement learning (IRL) aims to estimate the reward function of optimizing agents by observing their response (estimates or actions). This paper considers IRL when noisy estimates of the gradient of a reward function generated by multiple stochastic gradient agents are observed. We present a generalized Langevin dynamics algorithm to estimate the reward function $R(\theta)$; specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to $\exp(R(\theta))$. The proposed IRL algorithms use kernel-based passive learning schemes. We also construct multi-kernel passive Langevin algorithms for IRL which are suitable for high dimensional data. The performance of the proposed IRL algorithms are illustrated on examples in adaptive Bayesian learning, logistic regression (high dimensional problem) and constrained Markov decision processes. We prove weak convergence of the proposed IRL algorithms using martingale averaging methods. We also analyze the tracking performance of the IRL algorithms in non-stationary environments where the utility function $R(\theta)$ jump changes over time as a slow Markov chain.
Counterfactually Guided Policy Transfer in Clinical Settings
Killian, Taylor W., Ghassemi, Marzyeh, Joshi, Shalmali
Reliably transferring treatment policies learned in one clinical environment to another is currently limited by challenges related to domain shift. In this paper we address off-policy learning for sequential decision making under domain shift -- a scenario susceptible to catastrophic overconfidence -- which is highly relevant to a high-stakes clinical settings where the target domain may also be data-scarce. We propose a two-fold counterfactual regularization procedure to improve off-policy learning, addressing domain shift and data scarcity. First, we utilize an informative prior derived from a data-rich source environment to indirectly improve drawing counterfactual example observations. Then, these samples are then used to learn a policy for the target domain, regularized by the source policy through KL-divergence. In simulated sepsis treatment, our counterfactual policy transfer procedure significantly improves the performance of a learned treatment policy.
r/MachineLearning - [R] Google AI Blog: Using Selective Attention in Reinforcement Learning Agents
Inattentional blindness is the psychological phenomenon that causes one to miss things in plain sight, and is a consequence of the selective attention that enables you to remain focused on important parts of the world without distraction from irrelevant details. It is believed that this selective attention mechanism enables people to condense broad sensory information into a form that is compact enough to be used for future decision making. While this may seem to be a limitation, such "bottlenecks" observed in nature can also inspire the design of machine learning systems that hope to mimic the success and efficiency of biological organisms. For example, while most methods presented in the deep reinforcement learning (RL) literature allow an agent to access the entire visual input, and even incorporating modules for predicting future sequences of visual inputs, perhaps reducing an agent's access to its visual inputs via an attention constraint could be beneficial to an agent's performance?