AITopics

2006.11917

Country:

North America > United States > California (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)

Pacchiano, Aldo, Ball, Philip, Parker-Holder, Jack, Choromanski, Krzysztof, Roberts, Stephen

On Optimism in Model-Based Reinforcement Learning

arXiv.org Machine LearningJun-21-2020

The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL), often coming with strong theoretical guarantees. However, it remains a challenge to scale these approaches to the deep RL paradigm, which has achieved a great deal of attention in recent years. In this paper, we introduce a tractable approach to optimism via noise augmented Markov Decision Processes (MDPs), which we show can obtain a competitive regret bound: $\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{S}||\mathcal{A}| T } )$ when augmenting using Gaussian noise, where $T$ is the total number of environment steps. This tractability allows us to apply our approach to the deep RL setting, where we rigorously evaluate the key factors for success of optimistic model-based RL algorithms, bridging the gap between theory and practice.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2006.11911

Country:

North America > United States > California > Los Angeles County > Long Beach (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
(12 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Yuan, Weihang, Muñoz-Avila, Héctor

Hierarchical Reinforcement Learning for Deep Goal Reasoning: An Expressiveness Analysis

arXiv.org Artificial IntelligenceJun-20-2020

Hierarchical DQN (h-DQN) is a two-level architecture of feedforward neural networks where the meta level selects goals and the lower level takes actions to achieve the goals. We show tasks that cannot be solved by h-DQN, exemplifying the limitation of this type of hierarchical framework (HF). We describe the recurrent hierarchical framework (RHF), generalizing architectures that use a recurrent neural network at the meta level. We analyze the expressiveness of HF and RHF using context-sensitive grammars. We show that RHF is more expressive than HF. We perform experiments comparing an implementation of RHF with two HF baselines; the results corroborate our theoretical findings.

controller, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2006.11704

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Yang, Tsung-Yen, Rosca, Justinian, Narasimhan, Karthik, Ramadge, Peter J.

Accelerating Safe Reinforcement Learning with Constraint-mismatched Policies

arXiv.org Artificial IntelligenceJun-20-2020

We consider the problem of reinforcement learning when provided with a baseline control policy and a set of constraints that the controlled system must satisfy. The baseline policy might arise from a heuristic, a prior application, a teacher or demonstrator data. The constraints might encode safety, fairness or some application-specific requirements. We want to efficiently use reinforcement learning to adapt the baseline policy to improve performance and satisfy the given constraints when it is applied to the new system. The key challenge is to effectively use the baseline policy (which need not satisfy the current constraints) to aid the learning of a constraint-satisfying policy in the new application. We propose an iterative algorithm for solving this problem. Each iteration is composed of three-steps. The first step performs a policy update to increase the expected reward, the second step performs a projection to minimize the distance between the current policy and the baseline policy, and the last step performs a projection onto the set of policies that satisfy the constraints. This procedure allows the learning process to leverage the baseline policy to achieve faster learning while improving reward performance and satisfying the constraints imposed on the current problem. We analyze the convergence of the proposed algorithm and provide a finite-sample guarantee. Empirical results demonstrate that the algorithm can achieve superior performance, with 10 times fewer constraint violations and around 40% higher reward compared to state-of-the-art methods.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2006.11645

Genre: Research Report > Promising Solution (0.34)

Industry:

Leisure & Entertainment (0.93)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Policy Evaluation and Seeking for Multi-Agent Reinforcement Learning via Best Response

Yan, Rui, Duan, Xiaoming, Shi, Zongying, Zhong, Yisheng, Marden, Jason R., Bullo, Francesco

This paper introduces two metrics (cycle-based and memory-based metrics), grounded on a dynamical game-theoretic solution concept called sink equilibrium, for the evaluation, ranking, and computation of policies in multi-agent learning. We adopt strict best response dynamics (SBRD) to model selfish behaviors at a meta-level for multi-agent reinforcement learning. Our approach can deal with dynamical cyclical behaviors (unlike approaches based on Nash equilibria and Elo ratings), and is more compatible with single-agent reinforcement learning than alpha-rank which relies on weakly better responses. We first consider settings where the difference between largest and second largest underlying metric has a known lower bound. With this knowledge we propose a class of perturbed SBRD with the following property: only policies with maximum metric are observed with nonzero probability for a broad class of stochastic games with finite memory. We then consider settings where the lower bound for the difference is unknown. For this setting, we propose a class of perturbed SBRD such that the metrics of the policies observed with nonzero probability differ from the optimal by any given tolerance. The proposed perturbed SBRD addresses the opponent-induced non-stationarity by fixing the strategies of others for the learning agent, and uses empirical game-theoretic analysis to estimate payoffs for each strategy profile obtained due to the perturbation.

agent, joint strategy, sink equilibrium, (14 more...)

2006.09585

Country:

South America > Brazil > São Paulo (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
(8 more...)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Krishnamurthy, Vikram, Angley, Daniel, Evans, Robin, Moran, William

Identifying Cognitive Radars -- Inverse Reinforcement Learning using Revealed Preferences

arXiv.org Artificial IntelligenceJun-20-2020

We consider an inverse reinforcement learning problem involving us versus an enemy radar equipped with a Bayesian tracker. By observing the emissions of the enemy radar,how can we identify if the radar is cognitive (constrained utility maximizer)? Given the observed sequence of actions taken by the enemy's radar, we consider three problems: (i) Are the enemy radar's actions (waveform choice, beam scheduling) consistent with constrained utility maximization? If so how can we estimate the cognitive radar's utility function that is consistent with its actions. We formulate and solve the problem in terms of the spectra (eigenvalues) of the state and observation noise covariance matrices, and the algebraic Riccati equation. (ii) How to construct a statistical test for detecting a cognitive radar (constrained utility maximization) when we observe the radar's actions in noise or the radar observes our probe signal in noise? We propose a statistical detector with a tight Type-II error bound. (iii) How can we optimally probe (interrogate) the enemy's radar by choosing our state to minimize the Type-II error of detecting if the radar is deploying an economic rational strategy, subject to a constraint on the Type-I detection error? We present a stochastic optimization algorithm to optimize our probe signal. The main analysis framework used in this paper is that of revealed preferences from microeconomics.

machine learning, radar, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TSP.2020.3013516

1912.00331

Country:

North America > United States > New York (0.04)
North America > United States > New Jersey (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Cognitive Architectures (0.85)
(3 more...)

Russel, Reazul Hasan, Behzadian, Bahram, Petrik, Marek

Entropic Risk Constrained Soft-Robust Policy Optimization

Having a perfect model to compute the optimal policy is often infeasible in reinforcement learning. It is important in high-stakes domains to quantify and manage risk induced by model uncertainties. Entropic risk measure is an exponential utility-based convex risk measure that satisfies many reasonable properties. In this paper, we propose an entropic risk constrained policy gradient and actor-critic algorithms that are risk-averse to the model uncertainty. We demonstrate the usefulness of our algorithms on several problem domains.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2006.11679

Country: North America > United States > New Hampshire (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.35)

Krishnamurthy, Vikram, Yin, George

Langevin Dynamics for Inverse Reinforcement Learning of Stochastic Gradient Algorithms

Inverse reinforcement learning (IRL) aims to estimate the reward function of optimizing agents by observing their response (estimates or actions). This paper considers IRL when noisy estimates of the gradient of a reward function generated by multiple stochastic gradient agents are observed. We present a generalized Langevin dynamics algorithm to estimate the reward function $R(\theta)$; specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to $\exp(R(\theta))$. The proposed IRL algorithms use kernel-based passive learning schemes. We also construct multi-kernel passive Langevin algorithms for IRL which are suitable for high dimensional data. The performance of the proposed IRL algorithms are illustrated on examples in adaptive Bayesian learning, logistic regression (high dimensional problem) and constrained Markov decision processes. We prove weak convergence of the proposed IRL algorithms using martingale averaging methods. We also analyze the tracking performance of the IRL algorithms in non-stationary environments where the utility function $R(\theta)$ jump changes over time as a slow Markov chain.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2006.11674

Country:

North America > United States > New York (0.04)
North America > United States > Michigan > Wayne County > Detroit (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.90)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Killian, Taylor W., Ghassemi, Marzyeh, Joshi, Shalmali

Counterfactually Guided Policy Transfer in Clinical Settings

Reliably transferring treatment policies learned in one clinical environment to another is currently limited by challenges related to domain shift. In this paper we address off-policy learning for sequential decision making under domain shift -- a scenario susceptible to catastrophic overconfidence -- which is highly relevant to a high-stakes clinical settings where the target domain may also be data-scarce. We propose a two-fold counterfactual regularization procedure to improve off-policy learning, addressing domain shift and data scarcity. First, we utilize an informative prior derived from a data-rich source environment to indirectly improve drawing counterfactual example observations. Then, these samples are then used to learn a policy for the target domain, regularized by the source policy through KL-divergence. In simulated sepsis treatment, our counterfactual policy transfer procedure significantly improves the performance of a learned treatment policy.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2006.11654

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.31)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)

#artificialintelligenceJun-19-2020, 03:01:39 GMT

r/MachineLearning - [R] Google AI Blog: Using Selective Attention in Reinforcement Learning Agents

Inattentional blindness is the psychological phenomenon that causes one to miss things in plain sight, and is a consequence of the selective attention that enables you to remain focused on important parts of the world without distraction from irrelevant details. It is believed that this selective attention mechanism enables people to condense broad sensory information into a form that is compact enough to be used for future decision making. While this may seem to be a limitation, such "bottlenecks" observed in nature can also inspire the design of machine learning systems that hope to mimic the success and efficiency of biological organisms. For example, while most methods presented in the deep reinforcement learning (RL) literature allow an agent to access the entire visual input, and even incorporating modules for predicting future sequences of visual inputs, perhaps reducing an agent's access to its visual inputs via an attention constraint could be beneficial to an agent's performance?

artificial intelligence, machine learning, reinforcement learning agent, (4 more...)

#artificialintelligence

Industry: Media > News (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)