Markov Models
Decision-Theoretic Planning: Structural Assumptions and Computational Leverage
Boutilier, C., Dean, T., Hanks, S.
Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many different fields, including AI planning, decision analysis, operations research, control theory and economics. While the assumptions and perspectives adopted in these areas often differ in substantial ways, many planning problems of interest to researchers in these fields can be modeled as Markov decision processes (MDPs) and analyzed using the techniques of decision theory. This paper presents an overview and synthesis of MDP-related methods, showing how they provide a unifying framework for modeling many classes of planning problems studied in AI. It also describes structural properties of MDPs that, when exhibited by particular classes of problems, can be exploited in the construction of optimal or approximately optimal policies or plans. Planning problems commonly possess structure in the reward and value functions used to describe performance criteria, in the functions used to describe state transitions and observations, and in the relationships among features used to describe states, actions, rewards, and observations. Specialized representations, and algorithms employing these representations, can achieve computational leverage by exploiting these various forms of structure. Certain AI techniques -- in particular those based on the use of structured, intensional representations -- can be viewed in this way. This paper surveys several types of representations for both classical and decision-theoretic planning problems, and planning algorithms that exploit these representations in a number of different ways to ease the computational burden of constructing policies or plans. It focuses primarily on abstraction, aggregation and decomposition techniques based on AI-style representations.
Mean field for Markov Decision Processes: from Discrete to Continuous Optimization
Gast, Nicolas, Gaujal, Bruno, Boudec, Jean-Yves Le
We study the convergence of Markov Decision Processes made of a large number of objects to optimization problems on ordinary differential equations (ODE). We show that the optimal reward of such a Markov Decision Process, satisfying a Bellman equation, converges to the solution of a continuous Hamilton-Jacobi-Bellman (HJB) equation based on the mean field approximation of the Markov Decision Process. We give bounds on the difference of the rewards, and a constructive algorithm for deriving an approximating solution to the Markov Decision Process from a solution of the HJB equations. We illustrate the method on three examples pertaining respectively to investment strategies, population dynamics control and scheduling in queues are developed. They are used to illustrate and justify the construction of the controlled ODE and to show the gain obtained by solving a continuous HJB equation rather than a large discrete Bellman equation.
Sample-Based Planning for Continuous Action Markov Decision Processes
Mansley, Chris (Rutgers University) | Weinstein, Ari (Rutgers University) | Littman, Michael (Rutgers University)
In this paper, we present a new algorithm that integrates recent advances in solving continuous bandit problems with sample-based rollout methods for planning in Markov Decision Processes (MDPs). Our algorithm, Hierarchical Optimistic Optimization applied to Trees (HOOT) addresses planning in continuous-action MDPs. Empirical results are given that show that the performance of our algorithm meets or exceeds that of a similar discrete action planner by eliminating the problem of manual discretization of the action space.
Markov Decision Processes with Ordinal Rewards: Reference Point-Based Preferences
In a standard Markov decision process (MDP), rewards are assumed to be precisely known and of quantitative nature. This can be a too strong hypothesis in some situations. When rewards can really be modeled numerically, specifying the reward function is often difficult as it is a cognitively-demanding and/or time-consuming task. Besides, rewards can sometimes be of qualitative nature as when they represent qualitative risk levels for instance. In those cases, it is problematic to use directly standard MDPs and we propose instead to resort to MDPs with ordinal rewards. Only a total order over rewards is assumed to be known. In this setting, we explain how an alternative way to define expressive and interpretable preferences using reference points can be exploited.
Closing the Gap: Improved Bounds on Optimal POMDP Solutions
Poupart, Pascal (University of Waterloo) | Kim, Kee-Eung (KAIST) | Kim, Dongho (KAIST)
POMDP algorithms have made significant progress in recent years by allowing practitioners to find good solutions to increasingly large problems. Most approaches (including point-based and policy iteration techniques) operate by refining a lower bound of the optimal value function. Several approaches (e.g., HSVI2, SARSOP, grid-based approaches and online forward search) also refine an upper bound. However, approximating the optimal value function by an upper bound is computationally expensive and therefore tightness is often sacrificed to improve efficiency (e.g., sawtooth approximation). In this paper, we describe a new approach to efficiently compute tighter bounds by i) conducting a prioritized breadth first search over the reachable beliefs, ii) propagating upper bound improvements with an augmented POMDP and iii) using exact linear programming (instead of the sawtooth approximation) for upper bound interpolation. As a result, we can represent the bounds more compactly and significantly reduce the gap between upper and lower bounds on several benchmark problems.
Distributed Control of Situated Assistance in Large Domains with Many Tasks
Hoey, Jesse (University of Waterloo) | Grzes, Marek (University of Waterloo)
This paper tackles the problem of building situated prompting and assistance systems for guiding a human with a cognitive disability through a large domain containing multiple tasks. This problem is challenging because the target population has difficulty maintaining goals, recalling necessary steps and recognizing objects and potential actions (affordances), and therefore may not appear to be acting rationally. Prompts or cues from an automated system can be very helpful in this regard, but the domain is inherently partially observable due to sensor noise and uncertain human behaviours, making the task of selecting an appropriate prompt very challenging. Prior work has shown how such automated assistance for a single task can be modeled as a partially observable Markov decision process (POMDP). In this paper, we generalise this to multiple tasks, and show how to build a scalable, distributed and hierarchical controller. We demonstrate the algorithm in a set of simulated domains and show it can perform as well as the full model in many cases, and can give solutions to large problems (over 10 15 states and 10 9 observations) for which the full model fails to find a policy.
Hybrid Value Iteration for POMDPs
Maniloff, Diego (Massachusetts Institute of Technology) | Gmytrasiewicz, Piotr (University of Illinois at Chicago)
The Partially Observable Markov Decision Process (POMDP) provides a rich mathematical model for designing agents that have to formulate plans under uncertainty. The curses of dimensionality and history associated with solving POMDPs have lead to numerous refinements of the value iteration algorithm. Several exact methods with different pruning strategies have been devised, yet, limited scalability has lead research to focus on ways to approximate the optimal value function. One set of approximations relies on point-based value iteration, which maintains a fixed-size value function, and is typically executed offline. Another set of approximations relies on tree search, which explores the implicit tree defined by the value iteration equation, and is typically executed online. In this paper we present a hybrid value iteration algorithm that combines the benefits of point-based value iteration and tree search. Using our approach, a hybrid agent executes tree search online, and occasionally updates its offline-computed lower bound on the optimal value function, resulting in improved lookahead and higher obtained reward, while meeting real-time constraints. Thus, unlike other hybrid algorithms that use an invariant value function computed offline, our proposed scheme uses information from the real-time tree search process to reason whether to perform a point-based backup online. Keeping track of partial results obtained during online planning makes the computation of point-based backups less prohibitive. We report preliminary results that support our approach.
Learning a Tutorial Dialogue Policy for Delayed Feedback
Boyer, Kristy Elizabeth (North Carolina State University) | Phillips, Robert (North Carolina State University and Applied Research Associates, Inc.) | Ha, Eun Young (North Carolina State University) | Wallis, Michael (North Carolina State University and Applied Research Associates, Inc.) | Vouk, Mladen (North Carolina State University) | Lester, James (North Carolina State University)
Creating natural language tutorial dialogue systems that realize effective strategies is a central challenge for intelligent tutoring systems research. Traditional approaches generally require large development time, do not generalize well across domains, and do not match the flexibility and natural language sophistication of human tutors. A promising approach that may offer several benefits is data-driven system development, in which a dialogue policy is learned from corpora of human tutorial dialogue. To date these learning approaches typically focus on optimizing the tutorโs choice of act, and do not explicitly model the instances in which the tutor chose not to act. This paper reports on a hidden Markov modeling (HMM) approach within human textual tutorial dialogue that explicitly represents the tutorsโ choices not to intervene. The results show that an HMM that models tutor non-interventions predicts tutor moves significantly better than a model that does not explicitly represent the non-interventions. The findings have implications for automatically modeling tutorial strategies and for learning dialogue policies from corpora.
Optimizing Hidden Markov Models for Ocean Feature Detection
Kumar, Sandeep (Indian Institute of Technology Madras) | Celorrio, Sergio Jimenez (Universisdad Carlos III de Madrid) | Py, Frederic (Monterey Bay Aquarium Research Institute) | Khemani, Deepak (Indian Institute of Technology Madras) | Rajan, Kanna (Monterey Bay Aquarium Research Institute)
Given the diversity and spatio-temporal scales of dynamic coastal processes, sampling is a challenging task for oceanographers. To meet this challenge new robotic platforms such as Autonomous Underwater Vehicle (AUV) are being increasingly used. For effective water sampling during a mission an AUV should be adaptive to its environment, which requires it to be able to identify these dynamic and episodic ocean features in-situ. We describe the use of Hidden Markov Models (HMM) as a feature detection model used onboard an AUV, an autonomous untethered robot. We show how to build an identification model from data collected during past missions. Then we show how the parameters of the HMM can be optimized using a Genetic Algorithm approach, from models trained with the Baum-Welch algorithm in the initial population.
Feedback Message Passing for Inference in Gaussian Graphical Models
Liu, Ying, Chandrasekaran, Venkat, Anandkumar, Animashree, Willsky, Alan S.
While loopy belief propagation (LBP) performs reasonably well for inference in some Gaussian graphical models with cycles, its performance is unsatisfactory for many others. In particular for some models LBP does not converge, and in general when it does converge, the computed variances are incorrect (except for cycle-free graphs for which belief propagation (BP) is non-iterative and exact). In this paper we propose {\em feedback message passing} (FMP), a message-passing algorithm that makes use of a special set of vertices (called a {\em feedback vertex set} or {\em FVS}) whose removal results in a cycle-free graph. In FMP, standard BP is employed several times on the cycle-free subgraph excluding the FVS while a special message-passing scheme is used for the nodes in the FVS. The computational complexity of exact inference is $O(k^2n)$, where $k$ is the number of feedback nodes, and $n$ is the total number of nodes. When the size of the FVS is very large, FMP is intractable. Hence we propose {\em approximate FMP}, where a pseudo-FVS is used instead of an FVS, and where inference in the non-cycle-free graph obtained by removing the pseudo-FVS is carried out approximately using LBP. We show that, when approximate FMP converges, it yields exact means and variances on the pseudo-FVS and exact means throughout the remainder of the graph. We also provide theoretical results on the convergence and accuracy of approximate FMP. In particular, we prove error bounds on variance computation. Based on these theoretical results, we design efficient algorithms to select a pseudo-FVS of bounded size. The choice of the pseudo-FVS allows us to explicitly trade off between efficiency and accuracy. Experimental results show that using a pseudo-FVS of size no larger than $\log(n)$, this procedure converges much more often, more quickly, and provides more accurate results than LBP on the entire graph.