Markov Models
Robust Phi-Divergence MDPs
Ho, Chin Pang, Petrik, Marek, Wiesemann, Wolfram
In recent years, robust Markov decision processes (MDPs) have emerged as a prominent modeling framework for dynamic decision problems affected by uncertainty. In contrast to classical MDPs, which only account for stochasticity by modeling the dynamics through a stochastic process with a known transition kernel, robust MDPs additionally account for ambiguity by optimizing in view of the most adverse transition kernel from a prescribed ambiguity set. In this paper, we develop a novel solution framework for robust MDPs with s-rectangular ambiguity sets that decomposes the problem into a sequence of robust Bellman updates and simplex projections. Exploiting the rich structure present in the simplex projections corresponding to phi-divergence ambiguity sets, we show that the associated s-rectangular robust MDPs can be solved substantially faster than with state-of-the-art commercial solvers as well as a recent first-order solution scheme, thus rendering them attractive alternatives to classical MDPs in practical applications.
Multitask Weakly Supervised Learning for Origin Destination Travel Time Estimation
Wang, Hongjun, Zhang, Zhiwen, Fan, Zipei, Chen, Jiyuan, Zhang, Lingyu, Shibasaki, Ryosuke, Song, Xuan
Travel time estimation from GPS trips is of great importance to order duration, ridesharing, taxi dispatching, etc. However, the dense trajectory is not always available due to the limitation of data privacy and acquisition, while the origin destination (OD) type of data, such as NYC taxi data, NYC bike data, and Capital Bikeshare data, is more accessible. To address this issue, this paper starts to estimate the OD trips travel time combined with the road network. Subsequently, a Multitask Weakly Supervised Learning Framework for Travel Time Estimation (MWSL TTE) has been proposed to infer transition probability between roads segments, and the travel time on road segments and intersection simultaneously. Technically, given an OD pair, the transition probability intends to recover the most possible route. And then, the output of travel time is equal to the summation of all segments' and intersections' travel time in this route. A novel route recovery function has been proposed to iteratively maximize the current route's co occurrence probability, and minimize the discrepancy between routes' probability distribution and the inverse distribution of routes' estimation loss. Moreover, the expected log likelihood function based on a weakly supervised framework has been deployed in optimizing the travel time from road segments and intersections concurrently. We conduct experiments on a wide range of real world taxi datasets in Xi'an and Chengdu and demonstrate our method's effectiveness on route recovery and travel time estimation.
Safe Policy Improvement for POMDPs via Finite-State Controllers
Simรฃo, Thiago D., Suilen, Marnix, Jansen, Nils
We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the historical data and the memory of the FSC, and compute an improved policy using an off-the-shelf SPI algorithm. The underlying SPI method constrains the policy-space according to the available data, such that the newly computed policy only differs from the behavior policy when sufficient data was available. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability. Experimental results on several well-established benchmarks show the applicability of the approach, even in cases where finite memory is not sufficient.
TransfQMix: Transformers for Leveraging the Graph Structure of Multi-Agent Reinforcement Learning Problems
Gallici, Matteo, Martin, Mario, Masmitja, Ivan
Coordination is one of the most difficult aspects of multi-agent reinforcement learning (MARL). One reason is that agents normally choose their actions independently of one another. In order to see coordination strategies emerging from the combination of independent policies, the recent research has focused on the use of a centralized function (CF) that learns each agent's contribution to the team reward. However, the structure in which the environment is presented to the agents and to the CF is typically overlooked. We have observed that the features used to describe the coordination problem can be represented as vertex features of a latent graph structure. Here, we present TransfQMix, a new approach that uses transformers to leverage this latent structure and learn better coordination policies. Our transformer agents perform a graph reasoning over the state of the observable entities. Our transformer Q-mixer learns a monotonic mixing-function from a larger graph that includes the internal and external states of the agents. TransfQMix is designed to be entirely transferable, meaning that same parameters can be used to control and train larger or smaller teams of agents. This enables to deploy promising approaches to save training time and derive general policies in MARL, such as transfer learning, zero-shot transfer, and curriculum learning. We report TransfQMix's performances in the Spread and StarCraft II environments. In both settings, it outperforms state-of-the-art Q-Learning models, and it demonstrates effectiveness in solving problems that other methods can not solve.
Approximate Information States for Worst-Case Control and Learning in Uncertain Systems
Dave, Aditya, Venkatesh, Nishanth, Malikopoulos, Andreas A.
In this paper, we investigate discrete-time decision-making problems in uncertain systems with partially observed states. We consider a non-stochastic model, where uncontrolled disturbances acting on the system take values in bounded sets with unknown distributions. We present a general framework for decision-making in such problems by developing the notions of information states and approximate information states. In our definition of an information state, we introduce conditions to identify for an uncertain variable sufficient to construct a dynamic program (DP) that computes an optimal strategy. We show that many information states from the literature on worst-case control actions, e.g., the conditional range, are examples of our more general definition. Next, we relax these conditions to define approximate information states using only output variables, which can be learned from output data without knowledge of system dynamics. We use this notion to formulate an approximate DP that yields a strategy with a bounded performance loss. Finally, we illustrate the application of our results in control and reinforcement learning using numerical examples.
An Approach to Stochastic Dynamic Games with Asymmetric Information and Hidden Actions
Ouyang, Yi, Tavafoghi, Hamidreza, Teneketzis, Demosthenis
We study, in discrete time, a general class of sequential stochastic dynamic games with asymmetric information. We consider a setting where the underlying system has Markovian dynamics controlled by the agents' joint actions. Each agent's instantaneous utility depends on the agents' joint actions and the system state. At each time instant each agent makes a private noisy observation that depends on the current system state and the agents' actions in the previous time instant. In addition, at each time instant all agents may have a common noisy observation of the system state and their actions in the previous time instant. The agents' actions are hidden, that is, each agent's actions are not directly observable by the other agents. Therefore, at every time instant agents have asymmetric and imperfect information about the game's history. Dynamic games with the above features arise in engineering (cybersecurity, transportation, energy markets), in economics (industrial organization), and in socio-technological applications. As pointed out in Tang et al (2022), the key challenges in the study of dynamic games with asymmetric information are: (i) The domain of agents' strategies increases with time, as the agents acquire information over time.
Graph based Environment Representation for Vision-and-Language Navigation in Continuous Environments
Wang, Ting, Wu, Zongkai, Yao, Feiyu, Wang, Donglin
Vision-and-Language Navigation in Continuous Environments (VLN-CE) is a navigation task that requires an agent to follow a language instruction in a realistic environment. The understanding of environments is a crucial part of the VLN-CE task, but existing methods are relatively simple and direct in understanding the environment, without delving into the relationship between language instructions and visual environments. Therefore, we propose a new environment representation in order to solve the above problems. First, we propose an Environment Representation Graph (ERG) through object detection to express the environment in semantic level. This operation enhances the relationship between language and environment. Then, the relational representations of object-object, object-agent in ERG are learned through GCN, so as to obtain a continuous expression about ERG. Sequentially, we combine the ERG expression with object label embeddings to obtain the environment representation. Finally, a new cross-modal attention navigation framework is proposed, incorporating our environment representation and a special loss function dedicated to training ERG. Experimental result shows that our method achieves satisfactory performance in terms of success rate on VLN-CE tasks. Further analysis explains that our method attains better cross-modal matching and strong generalization ability.
Interpretable Hidden Markov Model-Based Deep Reinforcement Learning Hierarchical Framework for Predictive Maintenance of Turbofan Engines
Abbas, Ammar N., Chasparis, Georgios, Kelleher, John D.
An open research question in deep reinforcement learning is how to focus the policy learning of key decisions within a sparse domain. This paper emphasizes combining the advantages of inputoutput hidden Markov models and reinforcement learning towards interpretable maintenance decisions. We propose a novel hierarchical-modeling methodology that, at a high level, detects and interprets the root cause of a failure as well as the health degradation of the turbofan engine, while, at a low level, it provides the optimal replacement policy. It outperforms the baseline performance of deep reinforcement learning methods applied directly to the raw data or when using a hidden Markov model without such a specialized hierarchy. It also provides comparable performance to prior work, however, with the additional benefit of interpretability.
Unifying Consciousness and Time to Enhance Artificial Intelligence
Consciousness is a sequential process of awareness which can focus on one piece of information at a time. This process of awareness experiences causation which underpins the notion of time while it interplays with matter and energy, forming reality. The study of Consciousness, time and reality is complex and evolving fast in many fields, including metaphysics and fundamental physics. Reality composes patterns in human Consciousness in response to the regularities in nature. These regularities could be physical (e.g., astronomical, environmental), biological, chemical, mental, social, etc. The patterns that emerged in Consciousness were correlated to the environment, life and social behaviours followed by constructed frameworks, systems and structures. The complex constructs evolved as cultures, customs, norms and values, which created a diverse society. In the evolution of responsible AI, it is important to be attuned to the evolved cultural, ethical and moral values through Consciousness. This requires the advocated design of self-learning AI aware of time perception and human ethics.
AI-Based Affective Music Generation Systems: A Review of Methods, and Challenges
Music is a powerful medium for altering the emotional state of the listener. In recent years, with significant advancement in computing capabilities, artificial intelligence-based (AI-based) approaches have become popular for creating affective music generation (AMG) systems that are empowered with the ability to generate affective music. Entertainment, healthcare, and sensor-integrated interactive system design are a few of the areas in which AI-based affective music generation (AI-AMG) systems may have a significant impact. Given the surge of interest in this topic, this article aims to provide a comprehensive review of AI-AMG systems. The main building blocks of an AI-AMG system are discussed, and existing systems are formally categorized based on the core algorithm used for music generation. In addition, this article discusses the main musical features employed to compose affective music, along with the respective AI-based approaches used for tailoring them. Lastly, the main challenges and open questions in this field, as well as their potential solutions, are presented to guide future research. We hope that this review will be useful for readers seeking to understand the state-of-the-art in AI-AMG systems, and gain an overview of the methods used for developing them, thereby helping them explore this field in the future.