This paper investigates methods for estimating the optimal stochastic control policy for a Markov Decision Process with unknown transition dynamics and an unknown reward function. This form of model-free reinforcement learning comprises many real world systems such as playing video games, simulated control tasks, and real robot locomotion. Existing methods for estimating the optimal stochastic control policy rely on high variance estimates of the policy descent. However, these methods are not guaranteed to find the optimal stochastic policy, and the high variance gradient estimates make convergence unstable. In order to resolve these problems, we propose a technique using Markov Chain Monte Carlo to generate samples from the posterior distribution of the parameters conditioned on being optimal. Our method provably converges to the globally optimal stochastic policy, and empirically similar variance compared to the policy gradient.
We present an algorithm that identifies the reasoning patterns of agents in a game, by iteratively examining the graph structure of its Multi-Agent Influence Diagram (MAID) representation. If the decision of an agent participates in no reasoning patterns, then we can effectively ignore that decision for the purpose of calculating a Nash equilibrium for the game. In some cases, this can lead to exponential time savings in the process of equilibrium calculation. Moreover, our algorithm can be used to enumerate the reasoning patterns in a game, which can be useful for constructing more effective computerized agents interacting with humans.
We present a theoretical framework for online probabilistic plan recognition in cooperative multiagent systems. Our model extends the Abstract Hidden Markov Model (AHMM) (Bui, Venkatesh, & West 2002), and consists of a hierarchical dynamic Bayes network that allows reasoning about the interaction among multiple cooperating agents.
For many board and card games, computers have at least matched humans in playing skill. An exception is the game of poker, offering new research challenges. The complexity of the game is threefold, namely poker is (1) an imperfect information game, with (2) stochastic outcomes in (3) an adversarial multi-agent environment. One promising approach used for AI poker players applies an adaptive imperfect information game-tree search algorithm to decide which actions to take based on expected value (EV) estimates (Billings et al. 2006). This technique (and related simulation algorithms) require two estimations of opponent information to accurately compute the EV, namely a prediction of the opponent's outcome of the game and prediction of opponent actions. Therefore learning an opponent model is imperative and this model should include the possibility of using relational features for the game-state and -history. In this paper we consider a relational Bayesian approach that uses a general prior (for outcomes and actions) and learns a relational regression tree to adapt that prior to individual players. Using a prior will both allow us to make reasonable predictions from the start and adapt to individual opponents more quickly as long as the choice of prior is reasonable.
Leonardo Garrido Ram6n Brena Centro de Inteligencia Artificial, Tecnol6gico de Monterrey Abstract This paper presents recent results of our experimental work in quantifying exactly how useful is building models about other agents using no more than the observation of others' behavior. The testbed we used in our experiments is an abstraction of the meeting scheduling problem, called the Meeting Scheduling Game, which has competitive as well as cooperative features. The agents are selfish, and use a rational decision theoretic approach based on the probabilistic models that the agent is learning. We view agent modeling as an iterative and gradual process, where every new piece of information about a particular agent is analyzed in such a way that the model of the agent is further refined. We propose a framework for measuring the performance of different modelling strategies and establish quantified lower and upper limits for the performance of any modeling strategy. Finally, we contrast the performances of a modeler from an individual and from a collective point of view, comparing the benefits for the modeler itself as well as for the group as a whole. Introduction Katia Sycara The Robotics Institute, Carnegie Mellon University Several approaches in the field of multiagent systems (MAS) (Durfee 1991; Wooldridge & Jennings 1995) make heavy use of beliefs as an internal model of the world (Bratman 1987) One form of belief of particular importance in multiagent systems are the agent's beliefs about other agents (Vidal & Durfee 1997b). This kind of belief could come from preexisting knowledge base (a kind of"prejudice"), or could be inferred from observing others' behavior. The purpuse of a modelling activity could be to benefit a specific agent, in the case of "selfish" agents, or to improve the performance of a group as a whole, in the case of cooperative agents -or even a combination of both.