Markov Models
Learning Geometrically-Constrained Hidden Markov Models for Robot Navigation: Bridging the Topological-Geometrical Gap
Such maps specify the topology of important landmarks and situations (states), and routes or transitions (arcs) between them. They are concerned less with the physical location of landmarks, and more with topological relationships between situations. Typically, they are less complex and support much more ecient planning than metric maps. Topological maps are built on lowerlevel abstractions that allow the robot to move along arcs (perhaps by wall-or road-following), to recognize properties of locations, and to distinguish signicant locations as states; they are exible in allowing a more general notion of state, possibly including information about the non-geometrical aspects of the robot's situation. There are two typical strategies for deriving topological maps: one is to learn the topological map directly; the other is to rst learn a geometric map, then to derive a topological model from it through some process of analysis. A nice example of the second approach is provided by Thrun and B--ucken (1996a, 1996b; Thrun, 1999), who use occupancy-grid techniques to build the initial map. This strategy is appropriate when the primary cues for decomposition and abstraction of the map are geometric. However, in many cases, the nodes of a topological map are dened in terms of other sensory data (e.g., labels on a door or whether or not the robot is holding a bagel). Learning a geometric map rst also relies on the odometric abilities of a robot; if they are weak and the space is large, it is very dicult to derive a consistent map.
Policy Recognition in the Abstract Hidden Markov Model
Bui, H. H., Venkatesh, S., West, G.
In this paper, we present a method for recognising an agent's behaviour in dynamic, noisy, uncertain domains, and across multiple levels of abstraction. We term this problem on-line plan recognition under uncertainty and view it generally as probabilistic inference on the stochastic process representing the execution of the agent's plan. Our contributions in this paper are twofold. In terms of probabilistic inference, we introduce the Abstract Hidden Markov Model (AHMM), a novel type of stochastic processes, provide its dynamic Bayesian network (DBN) structure and analyse the properties of this network. We then describe an application of the Rao-Blackwellised Particle Filter to the AHMM which allows us to construct an efficient, hybrid inference method for this model. In terms of plan recognition, we propose a novel plan recognition framework based on the AHMM as the plan execution model. The Rao-Blackwellised hybrid inference for AHMM can take advantage of the independence properties inherent in a model of plan execution, leading to an algorithm for online probabilistic plan recognition that scales well with the number of levels in the plan hierarchy. This illustrates that while stochastic models for plan execution can be complex, they exhibit special structures which, if exploited, can lead to efficient plan recognition algorithms. We demonstrate the usefulness of the AHMM framework via a behaviour recognition system in a complex spatial environment using distributed video surveillance data.
Experiments with Infinite-Horizon, Policy-Gradient Estimation
Bartlett, P. L., Baxter, J., Weaver, L.
In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter and Bartlett, this volume), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter beta, which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of (Baxter and Bartlett, this volume) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.
Infinite-Horizon Policy-Gradient Estimation
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes POMDPs controlled by parameterized stochastic policies. A similar algorithm was proposed by (Kimura et al. 1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free beta (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter beta is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter et al., this volume) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.
Restricted Collapsed Draw: Accurate Sampling for Hierarchical Chinese Restaurant Process Hidden Markov Models
Makino, Takaki, Takei, Shunsuke, Sato, Issei, Mochihashi, Daichi
We propose a restricted collapsed draw (RCD) sampler, a general Markov chain Monte Carlo sampler of simultaneous draws from a hierarchical Chinese restaurant process (HCRP) with restriction. Models that require simultaneous draws from a hierarchical Dirichlet process with restriction, such as infinite Hidden markov models (iHMM), were difficult to enjoy benefits of \markerg{the} HCRP due to combinatorial explosion in calculating distributions of coupled draws. By constructing a proposal of seating arrangements (partitioning) and stochastically accepts the proposal by the Metropolis-Hastings algorithm, the RCD sampler makes accurate sampling for complex combination of draws while retaining efficiency of HCRP representation. Based on the RCD sampler, we developed a series of sophisticated sampling algorithms for iHMMs, including blocked Gibbs sampling, beam sampling, and split-merge sampling, that outperformed conventional iHMM samplers in experiments
An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email
This paper describes a novel method by which a spoken dialogue system can learn to choose an optimal dialogue strategy from its experience interacting with human users. The method is based on a combination of reinforcement learning and performance modeling of spoken dialogue systems. The reinforcement learning component applies Q-learning (Watkins, 1989), while the performance modeling component applies the PARADISE evaluation framework (Walker et al., 1997) to learn the performance function (reward) used in reinforcement learning. We illustrate the method with a spoken dialogue system named ELVIS (EmaiL Voice Interactive System), that supports access to email over the phone. We conduct a set of experiments for training an optimal dialogue strategy on a corpus of 219 dialogues in which human users interact with ELVIS over the phone. We then test that strategy on a corpus of 18 dialogues. We show that ELVIS can learn to optimize its strategy selection for agent initiative, for reading messages, and for summarizing email folders.
Speeding Up the Convergence of Value Iteration in Partially Observable Markov Decision Processes
Partially observable Markov decision processes (POMDPs) have recently become popular among many AI researchers because they serve as a natural model for planning under uncertainty. Value iteration is a well-known algorithm for finding optimal policies for POMDPs. It typically takes a large number of iterations to converge. This paper proposes a method for accelerating the convergence of value iteration. The method has been evaluated on an array of benchmark problems and was found to be very effective: It enabled value iteration to converge after only a few iterations on all the test problems.
Nonapproximability Results for Partially Observable Markov Decision Processes
Goldsmith, J., Lusena, C., Mundhenk, M.
Here \unlikely" means \unless some complexity classes collapse," where the collapses considered are P NP, P PSPACE, or P EXP. Until or unless these collapses are shown to hold, any control-policy designer must choose between such performance guarantees and ecient computation. In this work, we show that uncertainty breeds uncertainty: In a controlled stochastic system with uncertainty (as modeled by a partially observable Markov decision process, for example), plans can be obtained eciently or with quality guarantees, but rarely both. Planning over stochastic domains with uncertainty is hard (in some cases PSPACEhard or even undecidable, see Papadimitriou & Tsitsiklis, 1987; Madani, Hanks, & Condon, 1999). Given that it is hard to nd an optimal plan or policy, it is natural to try to nd one that is \good enough". In the best of all possible worlds, this would mean having an algorithm that is guaranteed to be fast and to produce a policy that is reasonably close to the optimal policy.
Value-Function Approximations for Partially Observable Markov Decision Processes
Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advantage of POMDPs, however, comes at a price -- exact methods for solving them are computationally very expensive and thus applicable in practice only to very simple problems. We focus on efficient approximation (heuristic) methods that attempt to alleviate the computational problem and trade off accuracy for speed. We have two objectives here. First, we survey various approximation methods, analyze their properties and relations and provide some new insights into their differences. Second, we present a number of new approximation methods and novel refinements of existing techniques. The theoretical results are supported by experiments on a problem from the agent navigation domain.
Committee-Based Sample Selection for Probabilistic Classifiers
Argamon-Engelson, S., Dagan, I.
In many real-world learning tasks, it is expensive to acquire a sufficient number of labeled examples for training. This paper investigates methods for reducing annotation cost by `sample selection'. In this approach, during training the learning program examines many unlabeled examples and selects for labeling only those that are most informative at each stage. This avoids redundantly labeling examples that contribute little new information. Our work follows on previous research on Query By Committee, extending the committee-based paradigm to the context of probabilistic classification. We describe a family of empirical methods for committee-based sample selection in probabilistic classification models, which evaluate the informativeness of an example by measuring the degree of disagreement between several model variants. These variants (the committee) are drawn randomly from a probability distribution conditioned by the training set labeled so far. The method was applied to the real-world natural language processing task of stochastic part-of-speech tagging. We find that all variants of the method achieve a significant reduction in annotation cost, although their computational efficiency differs. In particular, the simplest variant, a two member committee with no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduction in the size of the model used by the tagger.