AITopics

Tiapkin, Daniil, Stonyakin, Fedor, Gasnikov, Alexander

Parallel Stochastic Mirror Descent for MDPs

arXiv.org Machine LearningFeb-27-2021

We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs). For this purpose, some variant of Stochastic Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals. An important detail is the ability to use inexact values of functional constraints. We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method. Using this algorithm, we get the first parallel algorithm for average-reward MDPs with a generative model. One of the main features of the presented method is low communication costs in a distributed centralized setting.

algorithm, parallel stochastic mirror descent, stochastic mirror descent, (14 more...)

2103.00299

Country:

Asia > Russia (0.05)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
North America > United States (0.04)
(3 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)

Kuinchtner, Daniela, Sales, Afonso, Meneguzzi, Felipe

CP-MDP: A CANDECOMP-PARAFAC Decomposition Approach to Solve a Markov Decision Process Multidimensional Problem

arXiv.org Artificial IntelligenceFeb-27-2021

Markov Decision Process (MDP) is the underlying model for optimal planning for decision-theoretic agents in stochastic environments. Although much research focuses on solving MDP problems both in tabular form or using factored representations, none focused on tensor decomposition methods. Solving MDPs using tensor algebra offers the prospect of leveraging advances in tensor-based computations to further increase solver efficiency. In this paper, we develop an MDP solver for a multidimensional problem using a tensor decomposition method to compress the transition models and optimize the value iteration and policy iteration algorithms. We empirically evaluate our approach against tabular methods and show our approach can compute much larger problems using substantially less memory, opening up new possibilities for tensor-based approaches in stochastic planning

decomposition, dimension, iteration, (12 more...)

2103.00331

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.14)
Africa > Senegal > Kolda Region > Kolda (0.05)
South America > Brazil > Rio Grande do Sul (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.47)

Nüsken, Nikolas, Renger, D. R. Michiel

Stein Variational Gradient Descent: many-particle and long-time asymptotics

arXiv.org Machine LearningFeb-25-2021

Stein variational gradient descent (SVGD) refers to a class of methods for Bayesian inference based on interacting particle systems. In this paper, we consider the originally proposed deterministic dynamics as well as a stochastic variant, each of which represent one of the two main paradigms in Bayesian computational statistics: variational inference and Markov chain Monte Carlo. As it turns out, these are tightly linked through a correspondence between gradient flow structures and large-deviation principles rooted in statistical physics. To expose this relationship, we develop the cotangent space construction for the Stein geometry, prove its basic properties, and determine the large-deviation functional governing the many-particle limit for the empirical measure. Moreover, we identify the Stein-Fisher information (or kernelised Stein discrepancy) as its leading order contribution in the long-time and many-particle regime in the sense of $\Gamma$-convergence, shedding some light on the finite-particle properties of SVGD. Finally, we establish a comparison principle between the Stein-Fisher information and RKHS-norms that might be of independent interest.

stein-fisher information, survey article, upstream oil & gas, (16 more...)

2102.12956

Country:

Europe > Germany (0.46)
North America > United States > New York (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Energy > Oil & Gas > Upstream (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)

arXiv.org Machine LearningFeb-25-2021

Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally

Rajaraman, Nived, Han, Yanjun, Yang, Lin F., Ramchandran, Kannan, Jiao, Jiantao

We study the statistical limits of Imitation Learning (IL) in episodic Markov Decision Processes (MDPs) with a state space $\mathcal{S}$. We focus on the known-transition setting where the learner is provided a dataset of $N$ length-$H$ trajectories from a deterministic expert policy and knows the MDP transition. We establish an upper bound $O(|\mathcal{S}|H^{3/2}/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al (2020) which we prove to be computationally efficient. In contrast, we show the minimax suboptimality grows as $\Omega( H^{3/2}/N)$ when $|\mathcal{S}|\geq 3$ while the unknown-transition setting suffers from a larger sharp rate $\Theta(|\mathcal{S}|H^2/N)$ (Rajaraman et al (2020)). The lower bound is established by proving a two-way reduction between IL and the value estimation problem of the unknown expert policy under any given reward function, as well as building connections with linear functional estimation with subsampled observations. We further show that under the additional assumption that the expert is optimal for the true reward function, there exists an efficient algorithm, which we term as Mimic-Mixture, that provably achieves suboptimality $O(1/N)$ for arbitrary 3-state MDPs with rewards only at the terminal layer. In contrast, no algorithm can achieve suboptimality $O(\sqrt{H}/N)$ with high probability if the expert is not constrained to be optimal. Our work formally establishes the benefit of the expert optimal assumption in the known transition setting, while Rajaraman et al (2020) showed it does not help when transitions are unknown.

algorithm, probability, trajectory, (15 more...)

2102.12948

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Topin, Nicholay, Milani, Stephanie, Fang, Fei, Veloso, Manuela

Iterative Bounding MDPs: Learning Interpretable Policies via Non-Interpretable Methods

arXiv.org Artificial IntelligenceFeb-25-2021

Current work in explainable reinforcement learning generally produces policies in the form of a decision tree over the state space. Such policies can be used for formal safety verification, agent behavior prediction, and manual inspection of important features. However, existing approaches fit a decision tree after training or use a custom learning procedure which is not compatible with new learning techniques, such as those which use neural networks. To address this limitation, we propose a novel Markov Decision Process (MDP) type for learning decision tree policies: Iterative Bounding MDPs (IBMDPs). An IBMDP is constructed around a base MDP so each IBMDP policy is guaranteed to correspond to a decision tree policy for the base MDP when using a method-agnostic masking procedure. Because of this decision tree equivalence, any function approximator can be used during training, including a neural network, while yielding a decision tree policy for the base MDP. We present the required masking procedure as well as a modified value update step which allows IBMDPs to be solved using existing algorithms. We apply this procedure to produce IBMDP variants of recent reinforcement learning methods. We empirically show the benefits of our approach by solving IBMDPs to produce decision tree policies for the base MDPs.

agent, base mdp, mdp, (15 more...)

2102.13045

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.82)

Industry:

Government > Military (0.68)
Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.66)

Journal of Artificial Intelligence ResearchFeb-24-2021

A Sufficient Statistic for Influence in Structured Multiagent Environments

Oliehoek, Frans (Delft University of Technology) | Witwicki, Stefan (Nissan) | Kaelbling, Leslie (MIT)

Making decisions in complex environments is a key challenge in artificial intelligence (AI). Situations involving multiple decision makers are particularly complex, leading to computational intractability of principled solution methods. A body of work in AI has tried to mitigate this problem by trying to distill interaction to its essence: how does the policy of one agent influence another agent? If we can find more compact representations of such influence, this can help us deal with the complexity, for instance by searching the space of influences rather than the space of policies. However, so far these notions of influence have been restricted in their applicability to special cases of interaction. In this paper we formalize influence-based abstraction (IBA), which facilitates the elimination of latent state factors without any loss in value, for a very general class of problems described as factored partially observable stochastic games (fPOSGs). On the one hand, this generalizes existing descriptions of influence, and thus can serve as the foundation for improvements in scalability and other insights in decision making in complex multiagent settings. On the other hand, since the presence of other agents can be seen as a generalization of single agent settings, our formulation of IBA also provides a sufficient statistic for decision making under abstraction for a single agent. We also give a detailed discussion of the relations to such previous works, identifying new insights and interpretations of these approaches. In these ways, this paper deepens our understanding of abstraction in a wide range of sequential decision making settings, providing the basis for new approaches and algorithms for a large class of problems.

abstraction, agent, oliehoek, (13 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.1.12136

AI Access Foundation

12136

Journal of Artificial Intelligence Research

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
(6 more...)

Genre:

Research Report (0.45)
Overview (0.45)

Industry:

Leisure & Entertainment > Games (0.67)
Government > Regional Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(6 more...)

Lazic, Nevena, Yin, Dong, Abbasi-Yadkori, Yasin, Szepesvari, Csaba

Improved Regret Bound and Experience Replay in Regularized Policy Iteration

arXiv.org Machine LearningFeb-24-2021

In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from $O(T^{3/4})$ to $O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability $O(\sqrt{T})$ regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.

algorithm, experience replay, implementation, (12 more...)

2102.12611

Country:

North America > Canada > Alberta (0.14)
Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

arXiv.org Artificial IntelligenceFeb-24-2021

Credit Assignment with Meta-Policy Gradient for Multi-Agent Reinforcement Learning

Shao, Jianzhun, Zhang, Hongchang, Jiang, Yuhang, He, Shuncheng, Ji, Xiangyang

Reward decomposition is a critical problem in centralized training with decentralized execution (CTDE) paradigm for multi-agent reinforcement learning. To take full advantage of global information, which exploits the states from all agents and the related environment for decomposing Q values into individual credits, we propose a general meta-learning-based Mixing Network with Meta Policy Gradient (MNMPG) framework to distill the global hierarchy for delicate reward decomposition. The excitation signal for learning global hierarchy is deduced from the episode reward difference between before and after "exercise updates" through the utility network. Our method is generally applicable to the CTDE method using a monotonic mixing network. Experiments on the StarCraft II micromanagement benchmark demonstrate that our method just with a simple utility network is able to outperform the current state-of-the-art MARL algorithms on 4 of 5 super hard scenarios. Better performance can be further achieved when combined with a role-based utility network.

hierarchy, reinforcement learning, utility network, (14 more...)

2102.12957

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

arXiv.org Artificial IntelligenceFeb-24-2021

The Logical Options Framework

Araki, Brandon, Li, Xiao, Vodrahalli, Kiran, DeCastro, Jonathan, Fry, Micah J., Rus, Daniela

Learning composable policies for environments with complex rules and tasks is a challenging problem. We introduce a hierarchical reinforcement learning framework called the Logical Options Framework (LOF) that learns policies that are satisfying, optimal, and composable. LOF efficiently learns policies that satisfy tasks by representing the task as an automaton and integrating it into learning and planning. We provide and prove conditions under which LOF will learn satisfying, optimal policies. And lastly, we show how LOF's learned policies can be composed to satisfy unseen tasks with only 10-50 retraining steps. We evaluate LOF on four tasks in discrete and continuous domains, including a 3D pick-and-place environment.

optimality, proposition, subgoal, (14 more...)

2102.12571

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
South America > Colombia (0.04)
North America > United States > Massachusetts > Middlesex County > Lexington (0.04)
Europe > Germany (0.04)

Genre: Research Report (0.82)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)