Markov Models
Partially Observable Markov Decision Processes in Robotics: A Survey
Lauri, Mikko, Hsu, David, Pajarinen, Joni
Noisy sensing, imperfect control, and environment changes are defining characteristics of many real-world robot tasks. The partially observable Markov decision process (POMDP) provides a principled mathematical framework for modeling and solving robot decision and control tasks under uncertainty. Over the last decade, it has seen many successful applications, spanning localization and navigation, search and tracking, autonomous driving, multi-robot systems, manipulation, and human-robot interaction. This survey aims to bridge the gap between the development of POMDP models and algorithms at one end and application to diverse robot decision tasks at the other. It analyzes the characteristics of these tasks and connects them with the mathematical and algorithmic properties of the POMDP framework for effective modeling and solution. For practitioners, the survey provides some of the key task characteristics in deciding when and how to apply POMDPs to robot tasks successfully. For POMDP algorithm designers, the survey provides new insights into the unique challenges of applying POMDPs to robot systems and points to promising new directions for further research.
LCRL: Certified Policy Synthesis via Logically-Constrained Reinforcement Learning
Hasanbeig, Hosein, Kroening, Daniel, Abate, Alessandro
LCRL is a software tool that implements model-free Reinforcement Learning (RL) algorithms over unknown Markov Decision Processes (MDPs), synthesising policies that satisfy a given linear temporal specification with maximal probability. LCRL leverages partially deterministic finite-state machines known as Limit Deterministic Buchi Automata (LDBA) to express a given linear temporal specification. A reward function for the RL algorithm is shaped on-the-fly, based on the structure of the LDBA. Theoretical guarantees under proper assumptions ensure the convergence of the RL algorithm to an optimal policy that maximises the satisfaction probability. We present case studies to demonstrate the applicability, ease of use, scalability, and performance of LCRL. Owing to the LDBA-guided exploration and LCRL model-free architecture, we observe robust performance, which also scales well when compared to standard RL approaches (whenever applicable to LTL specifications). Full instructions on how to execute all the case studies in this paper are provided on a GitHub page that accompanies the LCRL distribution www.github.com/grockious/lcrl.
Off-Policy Risk Assessment in Markov Decision Processes
Huang, Audrey, Leqi, Liu, Lipton, Zachary Chase, Azizzadenesheli, Kamyar
Addressing such diverse ends as safety alignment with human preferences, and the efficiency of learning, a growing line of reinforcement learning research focuses on risk functionals that depend on the entire distribution of returns. Recent work on \emph{off-policy risk assessment} (OPRA) for contextual bandits introduced consistent estimators for the target policy's CDF of returns along with finite sample guarantees that extend to (and hold simultaneously over) all risk. In this paper, we lift OPRA to Markov decision processes (MDPs), where importance sampling (IS) CDF estimators suffer high variance on longer trajectories due to small effective sample size. To mitigate these problems, we incorporate model-based estimation to develop the first doubly robust (DR) estimator for the CDF of returns in MDPs. This estimator enjoys significantly less variance and, when the model is well specified, achieves the Cramer-Rao variance lower bound. Moreover, for many risk functionals, the downstream estimates enjoy both lower bias and lower variance. Additionally, we derive the first minimax lower bounds for off-policy CDF and risk estimation, which match our error bounds up to a constant factor. Finally, we demonstrate the precision of our DR CDF estimates experimentally on several different environments.
Bayesian regularization of empirical MDPs
Gupta, Samarth, Hill, Daniel N., Ying, Lexing, Dhillon, Inderjit
In most applications of model-based Markov decision processes, the parameters for the unknown underlying model are often estimated from the empirical data. Due to noise, the policy learnedfrom the estimated model is often far from the optimal policy of the underlying model. When applied to the environment of the underlying model, the learned policy results in suboptimal performance, thus calling for solutions with better generalization performance. In this work we take a Bayesian perspective and regularize the objective function of the Markov decision process with prior information in order to obtain more robust policies. Two approaches are proposed, one based on $L^1$ regularization and the other on relative entropic regularization. We evaluate our proposed algorithms on synthetic simulations and on real-world search logs of a large scale online shopping store. Our results demonstrate the robustness of regularized MDP policies against the noise present in the models.
Cautious Planning with Incremental Symbolic Perception: Designing Verified Reactive Driving Maneuvers
Kamale, Disha, Haesaert, Sofie, Vasile, Cristian-Ioan
This work presents a step towards utilizing incrementally-improving symbolic perception knowledge of the robot's surroundings for provably correct reactive control synthesis applied to an autonomous driving problem. Combining abstract models of motion control and information gathering, we show that assume-guarantee specifications (a subclass of Linear Temporal Logic) can be used to define and resolve traffic rules for cautious planning. We propose a novel representation called symbolic refinement tree for perception that captures the incremental knowledge about the environment and embodies the relationships between various symbolic perception inputs. The incremental knowledge is leveraged for synthesizing verified reactive plans for the robot. The case studies demonstrate the efficacy of the proposed approach in synthesizing control inputs even in case of partially occluded environments.
Rethinking Individual Global Max in Cooperative Multi-Agent Reinforcement Learning
Hong, Yitian, Jin, Yaochu, Tang, Yang
In cooperative multi-agent reinforcement learning, centralized training and decentralized execution (CTDE) has achieved remarkable success. Individual Global Max (IGM) decomposition, which is an important element of CTDE, measures the consistency between local and joint policies. The majority of IGM-based research focuses on how to establish this consistent relationship, but little attention has been paid to examining IGM's potential flaws. In this work, we reveal that the IGM condition is a lossy decomposition, and the error of lossy decomposition will accumulated in hypernetwork-based methods. To address the above issue, we propose to adopt an imitation learning strategy to separate the lossy decomposition from Bellman iterations, thereby avoiding error accumulation. The proposed strategy is theoretically proved and empirically verified on the StarCraft Multi-Agent Challenge benchmark problem with zero sight view. The results also confirm that the proposed method outperforms state-of-the-art IGM-based approaches.
An information-theoretic perspective on intrinsic motivation in reinforcement learning: a survey
Aubret, Arthur, Matignon, Laetitia, Hassas, Salima
Traditionally, an agent maximizes a reward defined according to the task to perform: it may be a score when the agent learns to solve a game or a distance function when the agent learns to reach a goal. The reward is then considered as extrinsic (or as a feedback) because the reward function is provided expertly and specifically for the task. With an extrinsic reward, many spectacular results have been obtained on Atari game [Bellemare et al. 2015] with the Deep Q-network (DQN) [Mnih et al. 2015] through the integration of deep learning to RL, leading to deep reinforcement learning (DRL). However, despite the recent improvements of DRL approaches, they turn out to be most of the time unsuccessful when the rewards are scattered in the environment, as the agent is then unable to learn the desired behavior for the targeted task [Francois-Lavet et al. 2018]. Moreover, the behaviors learned by the agent are hardly reusable, both within the same task and across many different tasks [Francois-Lavet et al. 2018]. It is difficult for an agent to generalize the learnt skills to make high-level decisions in the environment. For example, such skill could be go to the door using primitive actions consisting in moving in the four cardinal directions; or even to move forward controlling different joints of a humanoid robot like in the robotic simulator MuJoCo [Todorov et al. 2012]. On another side, unlike RL, developmental learning [Cangelosi and Schlesinger 2018; Oudeyer and Smith 2016; Piaget and Cook 1952] is based on the trend that babies, or more broadly organisms, acquire new skill while spontaneously exploring their environment [Barto 2013; Gopnik et al. 1999].
Semi-Supervised Imitation Learning of Team Policies from Suboptimal Demonstrations
Seo, Sangwon, Unhelkar, Vaibhav V.
We present Bayesian Team Imitation Learner (BTIL), an imitation learning algorithm to model the behavior of teams performing sequential tasks in Markovian domains. In contrast to existing multi-agent imitation learning techniques, BTIL explicitly models and infers the time-varying mental states of team members, thereby enabling learning of decentralized team policies from demonstrations of suboptimal teamwork. Further, to allow for sample- and label-efficient policy learning from small datasets, BTIL employs a Bayesian perspective and is capable of learning from semi-supervised demonstrations. We demonstrate and benchmark the performance of BTIL on synthetic multi-agent tasks as well as a novel dataset of human-agent teamwork. Our experiments show that BTIL can successfully learn team policies from demonstrations despite the influence of team members' (time-varying and potentially misaligned) mental states on their behavior.
Age of Semantics in Cooperative Communications: To Expedite Simulation Towards Real via Offline Reinforcement Learning
Chen, Xianfu, Zhao, Zhifeng, Mao, Shiwen, Wu, Celimuge, Zhang, Honggang, Bennis, Mehdi
The age of information metric fails to correctly describe the intrinsic semantics of a status update. In an intelligent reflecting surface-aided cooperative relay communication system, we propose the age of semantics (AoS) for measuring semantics freshness of the status updates. Specifically, we focus on the status updating from a source node (SN) to the destination, which is formulated as a Markov decision process (MDP). The objective of the SN is to maximize the expected satisfaction of AoS and energy consumption under the maximum transmit power constraint. To seek the optimal control policy, we first derive an online deep actor-critic (DAC) learning scheme under the on-policy temporal difference learning framework. However, implementing the online DAC in practice poses the key challenge in infinitely repeated interactions between the SN and the system, which can be dangerous particularly during the exploration. We then put forward a novel offline DAC scheme, which estimates the optimal control policy from a previously collected dataset without any further interactions with the system. Numerical experiments verify the theoretical results and show that our offline DAC scheme significantly outperforms the online DAC scheme and the most representative baselines in terms of mean utility, demonstrating strong robustness to dataset quality.
C-Causal Blindness An experimental computational framework on the isomorphic relationship between biological computation, artificial computation, and logic using weighted hidden Markov models
This text is concerned with a hypothetical flavour of cognitive blindness referred to in this paper as \textit{C-Causal Blindness} or C-CB. A cognitive blindness where the policy to obtain the objective leads to the state to be avoided. A literal example of C-CB would be \textit{Kurt G\"odel's} decision to starve for \textit{"fear of being poisoned"} - take this to be premise \textbf{A}. The objective being \textit{"to avoid being poisoned (so as to not die)"}: \textbf{C}, the plan or policy being \textit{"don't eat"}: \textbf{B}, and the actual outcome having been \textit{"dying"}: $\lnot$\textbf{C} - the state that G\"odel wanted to avoid to begin with. Like many, G\"odel pursued a strategy that caused the result he wanted to avoid. An experimental computational framework is proposed to show the isomorphic relationship between C-CB in brain computations, logic, and computer computations using hidden Markov models.