Plotting

 Parr, Ronald


Sample Complexity and Performance Bounds for Non-Parametric Approximate Linear Programming

AAAI Conferences

One of the most difficult tasks in value function approximation for Markov Decision Processes is finding an approximation architecture that is expressive enough to capture the important structure in the value function, while at the same time not overfitting the training samples. Recent results in non-parametric approximate linear programming (NP-ALP), have demonstrated that this can be done effectively using nothing more than a smoothness assumption on the value function. In this paper we extend these results to the case where samples come from real world transitions instead of the full Bellman equation, adding robustness to noise. In addition, we provide the first max-norm, finite sample performance guarantees for any form of ALP. NP-ALP is amenable to problems with large (multidimensional) or even infinite (continuous) action spaces, and does not require a model to select actions using the resulting approximate solution.


Computing Optimal Strategies to Commit to in Stochastic Games

AAAI Conferences

Significant progress has been made recently in the following two lines of research in the intersection of AI and game theory: (1) the computation of optimal strategies to commit to (Stackelberg strategies), and (2) the computation of correlated equilibria of stochastic games. In this paper, we unite these two lines of research by studying the computation of Stackelberg strategies in stochastic games. We provide theoretical results on the value of being able to commit and the value of being able to correlate, as well as complexity results about computing Stackelberg strategies in stochastic games. We then modify the QPACE algorithm (MacDermed et al. 2011) to compute Stackelberg strategies, and provide experimental results.


Greedy Algorithms for Sparse Reinforcement Learning

arXiv.org Machine Learning

Feature selection and regularization are becoming increasingly prominent tools in the efforts of the reinforcement learning (RL) community to expand the reach and applicability of RL. One approach to the problem of feature selection is to impose a sparsity-inducing form of regularization on the learning method. Recent work on $L_1$ regularization has adapted techniques from the supervised learning literature for use with RL. Another approach that has received renewed attention in the supervised learning community is that of using a simple algorithm that greedily adds new features. Such algorithms have many of the good properties of the $L_1$ regularization methods, while also being extremely efficient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. This paper considers variants of orthogonal matching pursuit (OMP) applied to reinforcement learning. The resulting algorithms are analyzed and compared experimentally with existing $L_1$ regularized approaches. We demonstrate that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant, OMP-BRM, provides promising theoretical guarantees under certain assumptions on the feature dictionary. Another variant, OMP-TD, empirically outperforms prior methods both in approximation accuracy and efficiency on several benchmark problems.


Non-Parametric Approximate Linear Programming for MDPs

AAAI Conferences

The Approximate Linear Programming (ALP) approach to value function approximation for MDPs is a parametric value function approximation method, in that it represents the value function as a linear combination of features which are chosen a priori. Choosing these features can be a difficult challenge in itself. One recent effort, Regularized Approximate Linear Programming (RALP), uses L1 regularization to address this issue by combining a large initial set of features with a regularization penalty that favors a smooth value function with few non-zero weights. Rather than using smoothness as a backhanded way of addressing the feature selection problem, this paper starts with smoothness and develops a non-parametric approach to ALP that is consistent with the smoothness assumption. We show that this new approach has some favorable practical and analytical properties in comparison to (R)ALP.


Linear Complementarity for Regularized Policy Evaluation and Improvement

Neural Information Processing Systems

Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-the-shelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a greedy" homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization."


Complexity of Computing Optimal Stackelberg Strategies in Security Resource Allocation Games

AAAI Conferences

Recently, algorithms for computing game-theoretic solutions have been deployed in real-world security applications, such as the placement of checkpoints and canine units at Los Angeles International Airport. These algorithms assume that the defender (security personnel) can commit to a mixed strategy, a so-called Stackelberg model. As pointed out by Kiekintveld et al. (2009), in these applications, generally, multiple resources need to be assigned to multiple targets, resulting in an exponential number of pure strategies for the defender. In this paper, we study how to compute optimal Stackelberg strategies in such games, showing that this can be done in polynomial time in some cases, and is NP-hard in others.


Hierarchical Linear/Constant Time SLAM Using Particle Filters for Dense Maps

Neural Information Processing Systems

We present an improvement to the DP-SLAM algorithm for simultaneous localizationand mapping (SLAM) that maintains multiple hypotheses about densely populated maps (one full map per particle in a particle filter)in time that is linear in all significant algorithm parameters and takes constant (amortized) time per iteration. This means that the asymptotic complexity of the algorithm is no greater than that of a pure localization algorithm using a single map and the same number of particles. Wealso present a hierarchical extension of DP-SLAM that uses a two level particle filter which models drift in the particle filtering process itself. The hierarchical approach enables recovery from the inevitable drift that results from using a finite number of particles in a particle filter and permits the use of DP-SLAM in more challenging domains, while maintaining linear time asymptotic complexity.


Learning in Zero-Sum Team Markov Games Using Factored Value Functions

Neural Information Processing Systems

We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.


Learning in Zero-Sum Team Markov Games Using Factored Value Functions

Neural Information Processing Systems

We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating againstan opposing team of agents. Our method requires full observability and communication during learning, but the learned policies canbe executed in a distributed manner. The value function is represented asa factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representationswith extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach isdemonstrated with an example that shows the efficiency gains over naive enumeration.


Multiagent Planning with Factored MDPs

Neural Information Processing Systems

We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function.