Markov Models
Robust Finite-State Controllers for Uncertain POMDPs
Cubuktepe, Murat, Jansen, Nils, Junges, Sebastian, Marandi, Ahmadreza, Suilen, Marnix, Topcu, Ufuk
Uncertain partially observable Markov decision processes (uPOMDPs) allow the probabilistic transition and observation functions of standard POMDPs to belong to a so-called uncertainty set. Such uncertainty sets capture uncountable sets of probability distributions. We develop an algorithm to compute finite-memory policies for uPOMDPs that robustly satisfy given specifications against any admissible distribution. In general, computing such policies is both theoretically and practically intractable. We provide an efficient solution to this problem in four steps. (1) We state the underlying problem as a nonconvex optimization problem with infinitely many constraints. (2) A dedicated dualization scheme yields a dual problem that is still nonconvex but has finitely many constraints. (3) We linearize this dual problem and (4) solve the resulting finite linear program to obtain locally optimal solutions to the original problem. The resulting problem formulation is exponentially smaller than those resulting from existing methods. We demonstrate the applicability of our algorithm using large instances of an aircraft collision-avoidance scenario and a novel spacecraft motion planning case study.
CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq
Vajjha, Koundinya, Shinnar, Avraham, Pestun, Vasily, Trager, Barry, Fulton, Nathan
Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work on formally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm in their Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learning algorithm converges to an optimal policy. This paper begins the work of closing this gap by developing a Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The central results are a formalization of Bellman's optimality principle and its proof, which uses a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL development exemplifies how the Giry monad and mechanized metric coinduction streamline optimality proofs for reinforcement learning algorithms. The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms, paving the way for further work on formalization of reinforcement learning algorithms.
Visual Methods for Sign Language Recognition: A Modality-Based Review
Seddik, Bassem, Amara, Najoua Essoukri Ben
Sign language visual recognition from continuous multi-modal streams is still one of the most challenging fields. Recent advances in human actions recognition are exploiting the ascension of GPU-based learning from massive data, and are getting closer to human-like performances. They are then prone to creating interactive services for the deaf and hearing-impaired communities. A population that is expected to grow considerably in the years to come. This paper aims at reviewing the human actions recognition literature with the sign-language visual understanding as a scope. The methods analyzed will be mainly organized according to the different types of unimodal inputs exploited, their relative multi-modal combinations and pipeline steps. In each section, we will detail and compare the related datasets, approaches then distinguish the still open contribution paths suitable for the creation of sign language related services. Special attention will be paid to the approaches and commercial solutions handling facial expressions and continuous signing.
LTLf Synthesis on Probabilistic Systems
Wells, Andrew M., Lahijanian, Morteza, Kavraki, Lydia E., Vardi, Moshe Y.
Many systems are naturally modeled as Markov Decision Processes (MDPs), combining probabilities and strategic actions. Given a model of a system as an MDP and some logical specification of system behavior, the goal of synthesis is to find a policy that maximizes the probability of achieving this behavior. A popular choice for defining behaviors is Linear Temporal Logic (LTL). Policy synthesis on MDPs for properties specified in LTL has been well studied. LTL, however, is defined over infinite traces, while many properties of interest are inherently finite. Linear Temporal Logic over finite traces (LTLf) has been used to express such properties, but no tools exist to solve policy synthesis for MDP behaviors given finite-trace properties. We present two algorithms for solving this synthesis problem: the first via reduction of LTLf to LTL and the second using native tools for LTLf. We compare the scalability of these two approaches for synthesis and show that the native approach offers better scalability compared to existing automaton generation tools for LTL.
The relationship between dynamic programming and active inference: the discrete, finite-horizon case
Da Costa, Lancelot, Sajid, Noor, Parr, Thomas, Friston, Karl, Smith, Ryan
Active inference is a normative framework for generating behaviour based upon the free energy principle, a theory of self-organisation. This framework has been successfully used to solve reinforcement learning and stochastic control problems, yet, the formal relation between active inference and reward maximisation has not been fully explicated. In this paper, we consider the relation between active inference and dynamic programming under the Bellman equation, which underlies many approaches to reinforcement learning and control. We show that, on partially observable Markov decision processes, dynamic programming is a limiting case of active inference. In active inference, agents select actions to minimise expected free energy. In the absence of ambiguity about states, this reduces to matching expected states with a target distribution encoding the agent's preferences. When target states correspond to rewarding states, this maximises expected reward, as in reinforcement learning. When states are ambiguous, active inference agents will choose actions that simultaneously minimise ambiguity. This allows active inference agents to supplement their reward maximising (or exploitative) behaviour with novelty-seeking (or exploratory) behaviour. This clarifies the connection between active inference and reinforcement learning, and how both frameworks may benefit from each other.
Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs
HasanzadeZonuzy, Aria, Bura, Archana, Kalathil, Dileep, Shakkottai, Srinivas
Many physical systems have underlying safety considerations that require that the policy employed ensures the satisfaction of a set of constraints. The analytical formulation usually takes the form of a Constrained Markov Decision Process (CMDP). We focus on the case where the CMDP is unknown, and RL algorithms obtain samples to discover the model and compute an optimal constrained policy. Our goal is to characterize the relationship between safety constraints and the number of samples needed to ensure a desired level of accuracy---both objective maximization and constraint satisfaction---in a PAC sense. We explore two classes of RL algorithms, namely, (i) a generative model based approach, wherein samples are taken initially to estimate a model, and (ii) an online approach, wherein the model is updated as samples are obtained. Our main finding is that compared to the best known bounds of the unconstrained regime, the sample complexity of constrained RL algorithms are increased by a factor that is logarithmic in the number of constraints, which suggests that the approach may be easily utilized in real systems.
A Derivative-free Method for Quantum Perceptron Training in Multi-layered Neural Networks
Khan, Tariq M., Robles-Kelly, Antonio
In this paper, we present a gradient-free approach for training multi-layered neural networks based upon quantum perceptrons. Here, we depart from the classical perceptron and the elemental operations on quantum bits, i.e. qubits, so as to formulate the problem in terms of quantum perceptrons. We then make use of measurable operators to define the states of the network in a manner consistent with a Markov process. This yields a Dirac-Von Neumann formulation consistent with quantum mechanics. Moreover, the formulation presented here has the advantage of having a computational efficiency devoid of the number of layers in the network. This, paired with the natural efficiency of quantum computing, can imply a significant improvement in efficiency, particularly for deep networks. Finally, but not least, the developments here are quite general in nature since the approach presented here can also be used for quantum-inspired neural networks implemented on conventional computers.
Exploiting Submodular Value Functions For Scaling Up Active Perception
Satsangi, Yash, Whiteson, Shimon, Oliehoek, Frans A., Spaan, Matthijs T. J.
In active perception tasks, an agent aims to select sensory actions that reduce its uncertainty about one or more hidden variables. While partially observable Markov decision processes (POMDPs) provide a natural model for such problems, reward functions that directly penalize uncertainty in the agent's belief can remove the piecewise-linear and convex property of the value function required by most POMDP planners. Furthermore, as the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially with it, making POMDP planning infeasible with traditional methods. In this article, we address a twofold challenge of modeling and planning for active perception tasks. We show the mathematical equivalence of $\rho$POMDP and POMDP-IR, two frameworks for modeling active perception tasks, that restore the PWLC property of the value function. To efficiently plan for active perception tasks, we identify and exploit the independence properties of POMDP-IR to reduce the computational cost of solving POMDP-IR (and $\rho$POMDP). We propose greedy point-based value iteration (PBVI), a new POMDP planning method that uses greedy maximization to greatly improve scalability in the action space of an active perception POMDP. Furthermore, we show that, under certain conditions, including submodularity, the value function computed using greedy PBVI is guaranteed to have bounded error with respect to the optimal value function. We establish the conditions under which the value function of an active perception POMDP is guaranteed to be submodular. Finally, we present a detailed empirical analysis on a dataset collected from a multi-camera tracking system employed in a shopping mall. Our method achieves similar performance to existing methods but at a fraction of the computational cost leading to better scalability for solving active perception tasks.
Reinforcement Learning Approaches in Social Robotics
In order to facilitate natural interaction, researchers in social robotics have focused on robots that can adapt to diverse conditions and to the different users with whom they interact. Recently, there has been great interest in the use of machine learning methods for adaptive social robots [48], [29], [106], [45], [49], [86]. Machine Learning (ML) algorithms can be categorized into three subfields [2]: supervised learning, unsupervised learning and reinforcement learning. In supervised learning, correct input/output pairs are available and the goal is to find a correct mapping from input to output space. In unsupervised learning, output data is not available and the goal is to find patterns in the input data. Reinforcement Learning (RL) [96] is a framework for decision-making problems in which an agent interacts through trial-and-error with its environment to discover an optimal behavior. The agent does not receive direct feedback of correctness, instead it receives scarce feedback about the actions it has taken in the past.
An Improved Approach of Intention Discovery with Machine Learning for POMDP-based Dialogue Management
An Embodied Conversational Agent (ECA) is an intelligent agent that works as the front end of software applications to interact with users through verbal/nonverbal expressions and to provide online assistance without the limits of time, location, and language. To help to improve the experience of human-computer interaction, there is an increasing need to empower ECA with not only the realistic look of its human counterparts but also a higher level of intelligence. This thesis first highlights the main topics related to the construction of ECA, including different approaches of dialogue management, and then discusses existing techniques of trend analysis for its application in user classification. As a further refinement and enhancement to prior work on ECA, this thesis research proposes a cohesive framework to integrate emotion-based facial animation with improved intention discovery. In addition, a machine learning technique is introduced to support sentiment analysis for the adjustment of policy design in POMDP-based dialogue management. The proposed research work is going to improve the accuracy of intention discovery while reducing the length of dialogues.