This paper investigates methods for estimating the optimal stochastic control policy for a Markov Decision Process with unknown transition dynamics and an unknown reward function. This form of model-free reinforcement learning comprises many real world systems such as playing video games, simulated control tasks, and real robot locomotion. Existing methods for estimating the optimal stochastic control policy rely on high variance estimates of the policy descent. However, these methods are not guaranteed to find the optimal stochastic policy, and the high variance gradient estimates make convergence unstable. In order to resolve these problems, we propose a technique using Markov Chain Monte Carlo to generate samples from the posterior distribution of the parameters conditioned on being optimal. Our method provably converges to the globally optimal stochastic policy, and empirically similar variance compared to the policy gradient.
Nguyen, Truong-Huy Dinh (National University of Singapore) | Hsu, David (National University of Singapore) | Lee, Wee-Sun (National University of Singapore) | Leong, Tze-Yun (National University of Singapore) | Kaelbling, Leslie Pack (Massachusetts Institute of Technology) | Lozano-Perez, Tomas (Massachusetts Institute of Technology) | Grant, Andrew Haydn (Singapore-MIT GAMBIT Game Lab)
We apply decision theoretic techniques to construct non-player characters that are able to assist a human player in collaborative games. The method is based on solving Markov decision processes, which can be difficult when the game state is described by many variables. To scale to more complex games, the method allows decomposition of a game task into subtasks, each of which can be modelled by a Markov decision process. Intention recognition is used to infer the subtask that the human is currently performing, allowing the helper to assist the human in performing the correct task. Experiments show that the method can be effective, giving near-human level performance in helping a human in a collaborative game.
Summary: I describe how the TrueSkill algorithm works using concepts you're already familiar with. TrueSkill is used on Xbox Live to rank and match players and it serves as a great way to understand how statistical machine learning is actually applied today. I've also created an open source project where I implemented TrueSkill three different times in increasing complexity and capability. In addition, I've created a detailed supplemental math paper that works out equations that I gloss over here. Feel free to jump to sections that look interesting and ignore ones that seem boring. Don't worry if this post seems a bit long, there are lots of pictures. It seemed easy enough: I wanted to create a database to track the skill levels of my coworkers in chess and foosball. I already knew that I wasn't very good at foosball and would bring down better players. I was curious if an algorithm could do a better job at creating well-balanced matches. I also wanted to see if I was improving at chess. I knew I needed to have an easy way to collect results from everyone and then use an algorithm that would keep getting better with more data. I was looking for a way to compress all that data and distill it down to some simple knowledge of how skilled people are. Based on some previous things that I had heard about, this seemed like a good fit for "machine learning." Machine learning is a hot area in Computer Science-- but it's intimidating. Like most subjects, there's a lot to learn to be an expert in the field. I didn't need to go very deep; I just needed to understand enough to solve my problem. I found a link to the paper describing the TrueSkill algorithm and I read it several times, but it didn't make sense. It was only 8 pages long, but it seemed beyond my capability to understand.
For many board and card games, computers have at least matched humans in playing skill. An exception is the game of poker, offering new research challenges. The complexity of the game is threefold, namely poker is (1) an imperfect information game, with (2) stochastic outcomes in (3) an adversarial multi-agent environment. One promising approach used for AI poker players applies an adaptive imperfect information game-tree search algorithm to decide which actions to take based on expected value (EV) estimates (Billings et al. 2006). This technique (and related simulation algorithms) require two estimations of opponent information to accurately compute the EV, namely a prediction of the opponent's outcome of the game and prediction of opponent actions. Therefore learning an opponent model is imperative and this model should include the possibility of using relational features for the game-state and -history. In this paper we consider a relational Bayesian approach that uses a general prior (for outcomes and actions) and learns a relational regression tree to adapt that prior to individual players. Using a prior will both allow us to make reasonable predictions from the start and adapt to individual opponents more quickly as long as the choice of prior is reasonable.
Eric B. Baum 1 NEC Research Institute, 4 Independence Way, Princeton NJ 08540 eric@research.NJ.NEC.COM Abstract The point of game tree search is to insulate oneself from errors in the evaluation function. The standard approach is to grow a full width tree as deep as time allows, and then value the tree as if the leaf evaluations were exact. This has been effective in many games because of the computational efficiency of the alpha-beta algorithm. A Bayesian would suggest instead to train a model of one's uncertainty. This model adds extra information in addition to the standard evaluation function. Within such a formal model, there is an optimal tree growth procedure and an optimal method of valueing the tree. We describe how to optimally value the tree, and how to approximate on line the optimal tree to search.