Human-in-the-Loop Synthesis for Partially Observable Markov Decision Processes Artificial Intelligence

We study planning problems where autonomous agents operate inside environments that are subject to uncertainties and not fully observable. Partially observable Markov decision processes (POMDPs) are a natural formal model to capture such problems. Because of the potentially huge or even infinite belief space in POMDPs, synthesis with safety guarantees is, in general, computationally intractable. We propose an approach that aims to circumvent this difficulty: in scenarios that can be partially or fully simulated in a virtual environment, we actively integrate a human user to control an agent. While the user repeatedly tries to safely guide the agent in the simulation, we collect data from the human input. Via behavior cloning, we translate the data into a strategy for the POMDP. The strategy resolves all nondeterminism and non-observability of the POMDP, resulting in a discrete-time Markov chain (MC). The efficient verification of this MC gives quantitative insights into the quality of the inferred human strategy by proving or disproving given system specifications. For the case that the quality of the strategy is not sufficient, we propose a refinement method using counterexamples presented to the human. Experiments show that by including humans into the POMDP verification loop we improve the state of the art by orders of magnitude in terms of scalability.

The Partially Observable Games We Play for Cyber Deception Artificial Intelligence

Progressively intricate cyber infiltration mechanisms have made conventional means of defense, such as firewalls and malware detectors, incompetent. These sophisticated infiltration mechanisms can study the defender's behavior, identify security caveats, and modify their actions adaptively. To tackle these security challenges, cyber-infrastructures require active defense techniques that incorporate cyber deception, in which the defender (deceiver) implements a strategy to mislead the infiltrator. To this end, we use a two-player partially observable stochastic game (POSG) framework, wherein the deceiver has full observability over the states of the POSG, and the infiltrator has partial observability. Then, the deception problem is to compute a strategy for the deceiver that minimizes the expected cost of deception against all strategies of the infiltrator. We first show that the underlying problem is a robust mixed-integer linear program, which is intractable to solve in general. Towards a scalable approach, we compute optimal finite-memory strategies for the infiltrator by a reduction to a series of synthesis problems for parametric Markov decision processes. We use these infiltration strategies to find robust strategies for the deceiver using mixed-integer linear programming. We illustrate the performance of our technique on a POSG model for network security. Our experiments demonstrate that the proposed approach handles scenarios considerably larger than those of the state-of-the-art methods.

POMDPs under Probabilistic Semantics Artificial Intelligence

We consider partially observable Markov decision processes (POMDPs) with limit-average payoff, where a reward value in the interval [0,1] is associated to every transition, and the payoff of an infinite path is the long-run average of the rewards. We consider two types of path constraints: (i) quantitative constraint defines the set of paths where the payoff is at least a given threshold lambda_1 in (0,1]; and (ii) qualitative constraint which is a special case of quantitative constraint with lambda_1=1. We consider the computation of the almost-sure winning set, where the controller needs to ensure that the path constraint is satisfied with probability 1. Our main results for qualitative path constraint are as follows: (i) the problem of deciding the existence of a finite-memory controller is EXPTIME-complete; and (ii) the problem of deciding the existence of an infinite-memory controller is undecidable. For quantitative path constraint we show that the problem of deciding the existence of a finite-memory controller is undecidable.

A Symbolic SAT-Based Algorithm for Almost-Sure Reachability with Small Strategies in POMDPs

AAAI Conferences

POMDPs are standard models for probabilistic planning problems, where an agent interacts with an uncertain environment. We study the problem of almost-sure reachability, where given a set of target states, the question is to decide whether there is a policy to ensure that the target set is reached with probability 1 (almost-surely). While in general the problem is EXPTIME-complete, in many practical cases policies with a small amount of memory suffice. Moreover, the existing solution to the problem is explicit, which first requires to construct explicitly an exponential reduction to a belief-support MDP. In this work, we first study the existence of observation-stationary strategies, which is NP-complete, and then small-memory strategies. We present a symbolic algorithm by an efficient encoding to SAT and using a SAT solver for the problem. We report experimental results demonstrating the scalability of our symbolic (SAT-based) approach.

Indefinite-Horizon Reachability in Goal-DEC-POMDPs

AAAI Conferences

DEC-POMDPs extend POMDPs to a multi-agent setting, where several agents operate in an uncertain environment independently to achieve a joint objective. DEC-POMDPs have been studied with finite-horizon and infinite-horizon discounted-sum objectives, and there exist solvers both for exact and approximate solutions. In this work we consider Goal-DEC-POMDPs, where given a set of target states, the objective is to ensure that the target set is reached with minimal cost.We consider the indefinite-horizon (infinite-horizon with either discounted-sum, or undiscounted-sum, where absorbing goal states have zero-cost) problem. We present a new and novel method to solve the problem that extends methods for finite-horizon DEC-POMDPs and the RTDP-Bel approach for POMDPs. We present experimental results on several examples, and show that our approach presents promising results.