Goto

Collaborating Authors

 Undirected Networks


Markov Decision Processes with Ordinal Rewards: Reference Point-Based Preferences

AAAI Conferences

In a standard Markov decision process (MDP), rewards are assumed to be precisely known and of quantitative nature. This can be a too strong hypothesis in some situations. When rewards can really be modeled numerically, specifying the reward function is often difficult as it is a cognitively-demanding and/or time-consuming task. Besides, rewards can sometimes be of qualitative nature as when they represent qualitative risk levels for instance. In those cases, it is problematic to use directly standard MDPs and we propose instead to resort to MDPs with ordinal rewards. Only a total order over rewards is assumed to be known. In this setting, we explain how an alternative way to define expressive and interpretable preferences using reference points can be exploited.


Closing the Gap: Improved Bounds on Optimal POMDP Solutions

AAAI Conferences

POMDP algorithms have made significant progress in recent years by allowing practitioners to find good solutions to increasingly large problems. Most approaches (including point-based and policy iteration techniques) operate by refining a lower bound of the optimal value function. Several approaches (e.g., HSVI2, SARSOP, grid-based approaches and online forward search) also refine an upper bound. However, approximating the optimal value function by an upper bound is computationally expensive and therefore tightness is often sacrificed to improve efficiency (e.g., sawtooth approximation). In this paper, we describe a new approach to efficiently compute tighter bounds by i) conducting a prioritized breadth first search over the reachable beliefs, ii) propagating upper bound improvements with an augmented POMDP and iii) using exact linear programming (instead of the sawtooth approximation) for upper bound interpolation. As a result, we can represent the bounds more compactly and significantly reduce the gap between upper and lower bounds on several benchmark problems.


Distributed Control of Situated Assistance in Large Domains with Many Tasks

AAAI Conferences

This paper tackles the problem of building situated prompting and assistance systems for guiding a human with a cognitive disability through a large domain containing multiple tasks. This problem is challenging because the target population has difficulty maintaining goals, recalling necessary steps and recognizing objects and potential actions (affordances), and therefore may not appear to be acting rationally. Prompts or cues from an automated system can be very helpful in this regard, but the domain is inherently partially observable due to sensor noise and uncertain human behaviours, making the task of selecting an appropriate prompt very challenging. Prior work has shown how such automated assistance for a single task can be modeled as a partially observable Markov decision process (POMDP). In this paper, we generalise this to multiple tasks, and show how to build a scalable, distributed and hierarchical controller. We demonstrate the algorithm in a set of simulated domains and show it can perform as well as the full model in many cases, and can give solutions to large problems (over 10 15 states and 10 9 observations) for which the full model fails to find a policy.


Hybrid Value Iteration for POMDPs

AAAI Conferences

The Partially Observable Markov Decision Process (POMDP) provides a rich mathematical model for designing agents that have to formulate plans under uncertainty. The curses of dimensionality and history associated with solving POMDPs have lead to numerous refinements of the value iteration algorithm. Several exact methods with different pruning strategies have been devised, yet, limited scalability has lead research to focus on ways to approximate the optimal value function. One set of approximations relies on point-based value iteration, which maintains a fixed-size value function, and is typically executed offline. Another set of approximations relies on tree search, which explores the implicit tree defined by the value iteration equation, and is typically executed online. In this paper we present a hybrid value iteration algorithm that combines the benefits of point-based value iteration and tree search. Using our approach, a hybrid agent executes tree search online, and occasionally updates its offline-computed lower bound on the optimal value function, resulting in improved lookahead and higher obtained reward, while meeting real-time constraints. Thus, unlike other hybrid algorithms that use an invariant value function computed offline, our proposed scheme uses information from the real-time tree search process to reason whether to perform a point-based backup online. Keeping track of partial results obtained during online planning makes the computation of point-based backups less prohibitive. We report preliminary results that support our approach.


Learning a Tutorial Dialogue Policy for Delayed Feedback

AAAI Conferences

Creating natural language tutorial dialogue systems that realize effective strategies is a central challenge for intelligent tutoring systems research. Traditional approaches generally require large development time, do not generalize well across domains, and do not match the flexibility and natural language sophistication of human tutors. A promising approach that may offer several benefits is data-driven system development, in which a dialogue policy is learned from corpora of human tutorial dialogue. To date these learning approaches typically focus on optimizing the tutor’s choice of act, and do not explicitly model the instances in which the tutor chose not to act. This paper reports on a hidden Markov modeling (HMM) approach within human textual tutorial dialogue that explicitly represents the tutors’ choices not to intervene. The results show that an HMM that models tutor non-interventions predicts tutor moves significantly better than a model that does not explicitly represent the non-interventions. The findings have implications for automatically modeling tutorial strategies and for learning dialogue policies from corpora.


Optimizing Hidden Markov Models for Ocean Feature Detection

AAAI Conferences

Given the diversity and spatio-temporal scales of dynamic coastal processes, sampling is a challenging task for oceanographers. To meet this challenge new robotic platforms such as Autonomous Underwater Vehicle (AUV) are being increasingly used. For effective water sampling during a mission an AUV should be adaptive to its environment, which requires it to be able to identify these dynamic and episodic ocean features in-situ. We describe the use of Hidden Markov Models (HMM) as a feature detection model used onboard an AUV, an autonomous untethered robot. We show how to build an identification model from data collected during past missions. Then we show how the parameters of the HMM can be optimized using a Genetic Algorithm approach, from models trained with the Baum-Welch algorithm in the initial population.


Feedback Message Passing for Inference in Gaussian Graphical Models

arXiv.org Artificial Intelligence

While loopy belief propagation (LBP) performs reasonably well for inference in some Gaussian graphical models with cycles, its performance is unsatisfactory for many others. In particular for some models LBP does not converge, and in general when it does converge, the computed variances are incorrect (except for cycle-free graphs for which belief propagation (BP) is non-iterative and exact). In this paper we propose {\em feedback message passing} (FMP), a message-passing algorithm that makes use of a special set of vertices (called a {\em feedback vertex set} or {\em FVS}) whose removal results in a cycle-free graph. In FMP, standard BP is employed several times on the cycle-free subgraph excluding the FVS while a special message-passing scheme is used for the nodes in the FVS. The computational complexity of exact inference is $O(k^2n)$, where $k$ is the number of feedback nodes, and $n$ is the total number of nodes. When the size of the FVS is very large, FMP is intractable. Hence we propose {\em approximate FMP}, where a pseudo-FVS is used instead of an FVS, and where inference in the non-cycle-free graph obtained by removing the pseudo-FVS is carried out approximately using LBP. We show that, when approximate FMP converges, it yields exact means and variances on the pseudo-FVS and exact means throughout the remainder of the graph. We also provide theoretical results on the convergence and accuracy of approximate FMP. In particular, we prove error bounds on variance computation. Based on these theoretical results, we design efficient algorithms to select a pseudo-FVS of bounded size. The choice of the pseudo-FVS allows us to explicitly trade off between efficiency and accuracy. Experimental results show that using a pseudo-FVS of size no larger than $\log(n)$, this procedure converges much more often, more quickly, and provides more accurate results than LBP on the entire graph.


On approximation of smoothing probabilities for hidden Markov models

arXiv.org Machine Learning

We consider the smoothing probabilities of hidden Markov model (HMM). We show that under fairly general conditions for HMM, the exponential forgetting still holds, and the smoothing probabilities can be well approximated with the ones of double sided HMM. This makes it possible to use ergodic theorems. As an applications we consider the pointwise maximum a posteriori segmentation, and show that the corresponding risks converge.


Variational Bayes approach for model aggregation in unsupervised classification with Markovian dependency

arXiv.org Machine Learning

We consider a binary unsupervised classification problem where each observation is associated with an unobserved label that we want to retrieve. More precisely, we assume that there are two groups of observation: normal and abnormal. The `normal' observations are coming from a known distribution whereas the distribution of the `abnormal' observations is unknown. Several models have been developed to fit this unknown distribution. In this paper, we propose an alternative based on a mixture of Gaussian distributions. The inference is done within a variational Bayesian framework and our aim is to infer the posterior probability of belonging to the class of interest. To this end, it makes no sense to estimate the mixture component number since each mixture model provides more or less relevant information to the posterior probability estimation. By computing a weighted average (named aggregated estimator) over the model collection, Bayesian Model Averaging (BMA) is one way of combining models in order to account for information provided by each model. The aim is then the estimation of the weights and the posterior probability for one specific model. In this work, we derive optimal approximations of these quantities from the variational theory and propose other approximations of the weights. To perform our method, we consider that the data are dependent (Markovian dependency) and hence we consider a Hidden Markov Model. A simulation study is carried out to evaluate the accuracy of the estimates in terms of classification. We also present an application to the analysis of public health surveillance systems.


Mean-Variance Optimization in Markov Decision Processes

arXiv.org Artificial Intelligence

We consider finite horizon Markov decision processes under performance measures that involve both the mean and the variance of the cumulative reward. We show that either randomized or history-based policies can improve performance. We prove that the complexity of computing a policy that maximizes the mean reward under a variance constraint is NPhard for some cases, and strongly NPhard for others. We finally offer pseudopolynomial exact and approximation algorithms. The classical theory of Markov decision processes (MDPs) deals with the maximization of the cumulative (possibly discounted) expected reward, to be denoted by W. However, a risk-averse decision maker may be interested in additional distributional properties of W. In this paper, we focus on the case where the decision maker is interested in both the mean and the variance of the cumulative reward, and we explore the associated computational issues. Risk aversion in MDPs is of course an old subject. Problems of this type can be handled by state augmentation (e.g., Bertsekas, 1995), namely, by introducing an auxiliary state variable that keeps track of the cumulative past reward. In a few special cases, e.g., with an exponential utility function, state augmentation is unnecessary, and optimal policies can be found by solving a modified Bellman equation (Chung & Sobel, 1987).