Goto

Collaborating Authors

 Bowling, Michael


Approximate Linear Programming for Constrained Partially Observable Markov Decision Processes

AAAI Conferences

In many situations, it is desirable to optimize a sequence of decisions by maximizing a primary objective while respecting some constraints with respect to secondary objectives. Such problems can be naturally modeled as constrained partially observable Markov decision processes (CPOMDPs) when the environment is partially observable. In this work, we describe a technique based on approximate linear programming to optimize policies in CPOMDPs. The optimization is performed offline and produces a finite state controller with desirable performance guarantees. The approach outperforms a constrained version of point-based value iteration on a suite of benchmark problems.


Improving Exploration in UCT Using Local Manifolds

AAAI Conferences

Monte-Carlo planning has been proven successful in manysequential decision-making settings, but it suffers from poorexploration when the rewards are sparse. In this paper, weimprove exploration in UCT by generalizing across similarstates using a given distance metric. We show that this algorithm,like UCT, converges asymptotically to the optimalaction. When the state space does not have a natural distancemetric, we show how we can learn a local manifold from thetransition graph of states in the near future. to obtain a distancemetric. On domains inspired by video games, empiricalevidence shows that our algorithm is more sample efficientthan UCT, particularly when rewards are sparse.


Solving Games with Functional Regret Estimation

AAAI Conferences

We propose a novel online learning method for minimizing regret in large extensive-form games. The approach learns a function approximator online to estimate the regret for choosing a particular action. A no-regret algorithm uses these estimates in place of the true regrets to define a sequence of policies. We prove the approach sound by providing a bound relating the quality of the function approximation and regret of the algorithm. A corollary being that the method is guaranteed to converge to a Nash equilibrium in self-play so long as the regrets are ultimately realizable by the function approximator. Our technique can be understood as a principled generalization of existing work onabstraction in large games; in our work, both the abstraction as well as the equilibrium are learned during self-play. We demonstrate empirically the method achieves higher quality strategies than state-of-the-art abstraction techniques given the same resources.


Optimal Estimation of Multivariate ARMA Models

AAAI Conferences

A central problem in applied data analysis is time series In this paper, we develop a tractable approach to maximum modeling--estimating and forecasting a discrete-time likelihood parameter estimation for stochastic multivariate stochastic process--for which the autoregressive moving ARMA models. To efficiently compute a globally average (ARMA) and stochastic ARMA (Thiesson et al. optimal estimate, the problem is re-expressed as a regularized 2012) are fundamental models. An ARMA model describes loss minimization, which then allows recent algorithmic the behavior of a linear dynamical system under advances in sparse estimation to be applied (Shah et al. latent Gaussian perturbations (Brockwell and Davis 2002; 2012; Candes et al. 2011; Bach, Mairal, and Ponce 2008; Lรผtkepohl 2007), which affords intuitive modeling capability, Zhang et al. 2011; White et al. 2012). Although there has efficient forecasting algorithms, and a close relationship been recent progress in global estimation for ARMA, such to linear Gaussian state-space models (Katayama 2006, approaches have either been restricted to single-input singleoutput pp.5-6).


Policy Tree: Adaptive Representation for Policy Gradient

AAAI Conferences

Much of the focus on finding good representations in reinforcement learning has been on learning complex non-linear predictors of value. Policy gradient algorithms, which directly represent the policy, often need fewer parameters to learn good policies. However, they typically employ a fixed parametric representation that may not be sufficient for complex domains. This paper introduces the Policy Tree algorithm, which can learn an adaptive representation of policy in the form of a decision tree over different instantiations of a base policy. Policy gradient is used both to optimize the parameters and to grow the tree by choosing splits that enable the maximum local increase in the expected return of the policy. Experiments show that this algorithm can choose genuinely helpful splits and significantly improve upon the commonly used linear Gibbs softmax policy, which we choose as our base policy.


Solving Games with Functional Regret Estimation

AAAI Conferences

We propose a novel online learning method for minimizing regret in large extensive-form games. The approach learns a function approximator online to estimate the regret for choosing a particular action. A no-regret algorithm uses these estimates in place of the true regrets to define a sequence of policies. We prove the approach sound by providing a bound relating the quality of the function approximation and regret of the algorithm. A corollary being that the method is guaranteed to converge to a Nash equilibrium in self-play so long as the regrets are ultimately realizable by the function approximator. Our technique can be understood as a principled generalization of existing work on abstraction in large games; in our work, both the abstraction as well as the equilibrium are learned during self-play. We demonstrate empirically the method achieves higher quality strategies than state-of-the-art abstraction techniques given the same resources.


Solving Games with Functional Regret Estimation

arXiv.org Artificial Intelligence

We propose a novel online learning method for minimizing regret in large extensive-form games. The approach learns a function approximator online to estimate the regret for choosing a particular action. A no-regret algorithm uses these estimates in place of the true regrets to define a sequence of policies. We prove the approach sound by providing a bound relating the quality of the function approximation and regret of the algorithm. A corollary being that the method is guaranteed to converge to a Nash equilibrium in self-play so long as the regrets are ultimately realizable by the function approximator. Our technique can be understood as a principled generalization of existing work on abstraction in large games; in our work, both the abstraction as well as the equilibrium are learned during self-play. We demonstrate empirically the method achieves higher quality strategies than state-of-the-art abstraction techniques given the same resources.


Search in Imperfect Information Games Using Online Monte Carlo Counterfactual Regret Minimization

AAAI Conferences

Online search in games has always been a core interest of artificial intelligence. Advances made in search for perfect information games (such as Chess, Checkers, Go, and Backgammon) have led to AI capable of defeating the world's top human experts. Search in imperfect information games (such as Poker, Bridge, and Skat) is significantly more challenging due to the complexities introduced by hidden information. In this paper, we present Online Outcome Sampling (OOS), the first imperfect information search algorithm that is guaranteed to converge to an equilibrium strategy in two-player zero-sum games. We show that OOS avoids common problems encountered by existing search algorithms and we experimentally evaluate its convergence rate and practical performance against benchmark strategies in Liar's Dice and a variant of Goofspiel. We show that unlike with Information Set Monte Carlo Tree Search (ISMCTS) the exploitability of the strategies produced by OOS decreases as the amount of search time increases. In practice, OOS performs as well as ISMCTS in head-to-head play while producing strategies with lower exploitability given the same search time.


Using Response Functions to Measure Strategy Strength

AAAI Conferences

Extensive-form games are a powerful tool for representing complex multi-agent interactions. Nash equilibrium strategies are commonly used as a solution concept for extensive-form games, but many games are too large for the computation of Nash equilibria to be tractable. In these large games, exploitability has traditionally been used to measure deviation from Nash equilibrium, and thus strategies are aimed to achieve minimal exploitability. However, while exploitability measures a strategy's worst-case performance, it fails to capture how likely that worst-case is to be observed in practice. In fact, empirical evidence has shown that a less exploitable strategy can perform worse than a more exploitable strategy in one-on-one play against a variety of opponents. In this work, we propose a class of response functions that can be used to measure the strength of a strategy. We prove that standard no-regret algorithms can be used to learn optimal strategies for a scenario where the opponent uses one of these response functions. We demonstrate the effectiveness of this technique in Leduc Hold'em against opponents that use the UCT Monte Carlo tree search algorithm.


Solving Imperfect Information Games Using Decomposition

AAAI Conferences

Decomposition, i.e. independently analyzing possible subgames, has proven to be an essential principle for effective decision-making in perfect information games. However, in imperfect information games, decomposition has proven to be problematic. To date, all proposed techniques for decomposition in imperfect information games have abandoned theoretical guarantees. This work presents the first technique for decomposing an imperfect information game into subgames that can be solved independently, while retaining optimality guarantees on the full-game solution. We can use this technique to construct theoretically justified algorithms that make better use of information available at run-time, overcome memory or disk limitations at run-time, or make a time/space trade-off to overcome memory or disk limitations while solving a game. In particular, we present an algorithm for subgame solving which guarantees performance in the whole game, in contrast to existing methods which may have unbounded error. In addition, we present an offline game solving algorithm, CFR-D, which can produce a Nash equilibrium for a game that is larger than available storage.