Goto

Collaborating Authors

 observable markov decision process


Off-Policy Evaluation for Episodic Partially Observable Markov Decision Processes under Non-Parametric Models

Neural Information Processing Systems

We study the problem of off-policy evaluation (OPE) for episodic Partially Observable Markov Decision Processes (POMDPs) with continuous states. Motivated by the recently proposed proximal causal inference framework, we develop a non-parametric identification result for estimating the policy value via a sequence of so-called V-bridge functions with the help of time-dependent proxy variables. We then develop a fitted-Q-evaluation-type algorithm to estimate V-bridge functions recursively, where a non-parametric instrumental variable (NPIV) problem is solved at each step. By analyzing this challenging sequential NPIV estimation, we establish the finite-sample error bounds for estimating the V-bridge functions and accordingly that for evaluating the policy value, in terms of the sample size, length of horizon and so-called (local) measure of ill-posedness at each step. To the best of our knowledge, this is the first finite-sample error bound for OPE in POMDPs under non-parametric models.


Expert-Guided POMDP Learning for Data-Efficient Modeling in Healthcare

Locatelli, Marco, Hommersom, Arjen, Cerioli, Roberto Clemens, Besozzi, Daniela, Stella, Fabio

arXiv.org Artificial Intelligence

Learning the parameters of Partially Observable Markov Decision Processes (POMDPs) from limited data is a significant challenge. We introduce the Fuzzy MAP EM algorithm, a novel approach that incorporates expert knowledge into the parameter estimation process by enriching the Expectation Maximization (EM) framework with fuzzy pseudo-counts derived from an expert-defined fuzzy model. This integration naturally reformulates the problem as a Maximum A Posteriori (MAP) estimation, effectively guiding learning in environments with limited data. In synthetic medical simulations, our method consistently outperforms the standard EM algorithm under both low-data and high-noise conditions. Furthermore, a case study on Myasthenia Gravis illustrates the ability of the Fuzzy MAP EM algorithm to recover a clinically coherent POMDP, demonstrating its potential as a practical tool for data-efficient modeling in healthcare.


Multi-Environment POMDPs: Discrete Model Uncertainty Under Partial Observability

Bovy, Eline M., Probine, Caleb, Suilen, Marnix, Topcu, Ufuk, Jansen, Nils

arXiv.org Artificial Intelligence

Multi-environment POMDPs (ME-POMDPs) extend standard POMDPs with discrete model uncertainty. ME-POMDPs represent a finite set of POMDPs that share the same state, action, and observation spaces, but may arbitrarily vary in their transition, observation, and reward models. Such models arise, for instance, when multiple domain experts disagree on how to model a problem. The goal is to find a single policy that is robust against any choice of POMDP within the set, i.e., a policy that maximizes the worst-case reward across all POMDPs. We generalize and expand on existing work in the following way. First, we show that ME-POMDPs can be generalized to POMDPs with sets of initial beliefs, which we call adversarial-belief POMDPs (AB-POMDPs). Second, we show that any arbitrary ME-POMDP can be reduced to a ME-POMDP that only varies in its transition and reward functions or only in its observation and reward functions, while preserving (optimal) policies. We then devise exact and approximate (point-based) algorithms to compute robust policies for AB-POMDPs, and thus ME-POMDPs. We demonstrate that we can compute policies for standard POMDP benchmarks extended to the multi-environment setting.


Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs

Galesloot, Maris F. L., Andriushchenko, Roman, Češka, Milan, Junges, Sebastian, Jansen, Nils

arXiv.org Artificial Intelligence

Partially observable Markov decision processes (POMDPs) model specific environments in sequential decision-making under uncertainty. Critically, optimal policies for POMDPs may not be robust against perturbations in the environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different environment models, that is, POMDPs with a shared action and observation space. The intuition is that the true model is hidden among a set of potential models, and it is unknown which model will be the environment at execution time. A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs. We compute such robust policies by combining two orthogonal techniques: (1) a deductive formal verification technique that supports tractable robust policy evaluation by computing a worst-case POMDP within the HM-POMDP, and (2) subgradient ascent to optimize the candidate policy for a worst-case POMDP. The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs, and (2) scales to HM-POMDPs that consist of over a hundred thousand environments.


Partially Observable Monte-Carlo Graph Search

You, Yang, Thomas, Vincent, Schutz, Alex, Skilton, Robert, Hawes, Nick, Buffet, Olivier

arXiv.org Artificial Intelligence

Currently, large partially observable Markov decision processes (POMDPs) are often solved by sampling-based online methods which interleave planning and execution phases. However, a pre-computed offline policy is more desirable in POMDP applications with time or energy constraints. But previous offline algorithms are not able to scale up to large POMDPs. In this article, we propose a new sampling-based algorithm, the partially observable Monte-Carlo graph search (POMCGS) to solve large POMDPs offline. Different from many online POMDP methods, which progressively develop a tree while performing (Monte-Carlo) simulations, POMCGS folds this search tree on the fly to construct a policy graph, so that computations can be drastically reduced, and users can analyze and validate the policy prior to embedding and executing it. Moreover, POMCGS, together with action progressive widening and observation clustering methods provided in this article, is able to address certain continuous POMDPs. Through experiments, we demonstrate that POMCGS can generate policies on the most challenging POMDPs, which cannot be computed by previous offline algorithms, and these policies' values are competitive compared with the state-of-the-art online POMDP algorithms.


POMDP-Based Trajectory Planning for On-Ramp Highway Merging

Kollarčík, Adam, Hanzálek, Zdeněk

arXiv.org Artificial Intelligence

This paper addresses the trajectory planning problem for automated vehicle on-ramp highway merging. To tackle this challenge, we extend our previous work on trajectory planning at unsignalized intersections using Partially Observable Markov Decision Processes (POMDPs). The method utilizes the Adaptive Belief Tree (ABT) algorithm, an approximate sampling-based approach to solve POMDPs efficiently. We outline the POMDP formulation process, beginning with discretizing the highway topology to reduce problem complexity. Additionally, we describe the dynamics and measurement models used to predict future states and establish the relationship between available noisy measurements and predictions. Building on our previous work, the dynamics model is expanded to account for lateral movements necessary for lane changes during the merging process. We also define the reward function, which serves as the primary mechanism for specifying the desired behavior of the automated vehicle, combining multiple goals such as avoiding collisions or maintaining appropriate velocity. Our simulation results, conducted on three scenarios based on real-life traffic data from German highways, demonstrate the method's ability to generate safe, collision-free, and efficient merging trajectories. This work shows the versatility of this POMDP-based approach in tackling various automated driving problems.


Parameter Adjustments in POMDP-Based Trajectory Planning for Unsignalized Intersections

Hanzálek, Adam Kollarčík adn Zdeněk

arXiv.org Artificial Intelligence

This paper investigates the problem of trajectory planning for autonomous vehicles at unsignalized intersections, specifically focusing on scenarios where the vehicle lacks the right of way and yet must cross safely. To address this issue, we have employed a method based on the Partially Observable Markov Decision Processes (POMDPs) framework designed for planning under uncertainty. The method utilizes the Adaptive Belief Tree (ABT) algorithm as an approximate solver for the POMDPs. We outline the POMDP formulation, beginning with discretizing the intersection's topology. Additionally, we present a dynamics model for the prediction of the evolving states of vehicles, such as their position and velocity. Using an observation model, we also describe the connection of those states with the imperfect (noisy) available measurements. Our results confirmed that the method is able to plan collision-free trajectories in a series of simulations utilizing real-world traffic data from aerial footage of two distinct intersections. Furthermore, we studied the impact of parameter adjustments of the ABT algorithm on the method's performance. This provides guidance in determining reasonable parameter settings, which is valuable for future method applications.


Off-Policy Evaluation for Episodic Partially Observable Markov Decision Processes under Non-Parametric Models

Neural Information Processing Systems

We study the problem of off-policy evaluation (OPE) for episodic Partially Observable Markov Decision Processes (POMDPs) with continuous states. Motivated by the recently proposed proximal causal inference framework, we develop a non-parametric identification result for estimating the policy value via a sequence of so-called V-bridge functions with the help of time-dependent proxy variables. We then develop a fitted-Q-evaluation-type algorithm to estimate V-bridge functions recursively, where a non-parametric instrumental variable (NPIV) problem is solved at each step. By analyzing this challenging sequential NPIV estimation, we establish the finite-sample error bounds for estimating the V-bridge functions and accordingly that for evaluating the policy value, in terms of the sample size, length of horizon and so-called (local) measure of ill-posedness at each step. To the best of our knowledge, this is the first finite-sample error bound for OPE in POMDPs under non-parametric models.


Pessimistic Iterative Planning for Robust POMDPs

Galesloot, Maris F. L., Suilen, Marnix, Simão, Thiago D., Carr, Steven, Spaan, Matthijs T. J., Topcu, Ufuk, Jansen, Nils

arXiv.org Artificial Intelligence

Robust partially observable Markov decision processes (robust POMDPs) extend classical POMDPs to handle additional uncertainty on the transition and observation probabilities via so-called uncertainty sets. Policies for robust POMDPs must not only be memory-based to account for partial observability but also robust against model uncertainty to account for the worst-case instances from the uncertainty sets. We propose the pessimistic iterative planning (PIP) framework, which finds robust memory-based policies for robust POMDPs. PIP alternates between two main steps: (1) selecting an adversarial (non-robust) POMDP via worst-case probability instances from the uncertainty sets; and (2) computing a finite-state controller (FSC) for this adversarial POMDP. We evaluate the performance of this FSC on the original robust POMDP and use this evaluation in step (1) to select the next adversarial POMDP. Within PIP, we propose the rFSCNet algorithm. In each iteration, rFSCNet finds an FSC through a recurrent neural network trained using supervision policies optimized for the adversarial POMDP. The empirical evaluation in four benchmark environments showcases improved robustness against a baseline method in an ablation study and competitive performance compared to a state-of-the-art robust POMDP solver.


Imprecise Probabilities Meet Partial Observability: Game Semantics for Robust POMDPs

Bovy, Eline M., Suilen, Marnix, Junges, Sebastian, Jansen, Nils

arXiv.org Artificial Intelligence

Partially observable Markov decision processes (POMDPs) rely on the key assumption that probability distributions are precisely known. Robust POMDPs (RPOMDPs) alleviate this concern by defining imprecise probabilities, referred to as uncertainty sets. While robust MDPs have been studied extensively, work on RPOMDPs is limited and primarily focuses on algorithmic solution methods. We expand the theoretical understanding of RPOMDPs by showing that 1) different assumptions on the uncertainty sets affect optimal policies and values; 2) RPOMDPs have a partially observable stochastic game (POSG) semantic; and 3) the same RPOMDP with different assumptions leads to semantically different POSGs and, thus, different policies and values. These novel semantics for RPOMDPS give access to results for the widely studied POSG model; concretely, we show the existence of a Nash equilibrium. Finally, we classify the existing RPOMDP literature using our semantics, clarifying under which uncertainty assumptions these existing works operate.