Markov Models
Towards Measuring Goal-Directedness in AI Systems
Recent advances in deep learning have brought attention to the possibility of creating advanced, general AI systems that outperform humans across many tasks. However, if these systems pursue unintended goals, there could be catastrophic consequences. A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner in the first place, optimizing for some unknown goal; there exists significant research trying to evaluate systems for said behaviors. However, the most rigorous definitions of goal-directedness we currently have are difficult to compute in real-world settings. Drawing upon this previous literature, we explore policy goal-directedness within reinforcement learning (RL) environments. In our findings, we propose a different family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many (sparse) reward functions. We operationalize this preliminary definition of goal-directedness and test it in toy Markov decision process (MDP) environments. Furthermore, we explore how goal-directedness could be measured in frontier large-language models (LLMs). Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals. We recommend further exploration of measuring coherence and goal-directedness, based on our findings.
Explainable Finite-Memory Policies for Partially Observable Markov Decision Processes
Azeem, Muqsit, Chakraborty, Debraj, Kanav, Sudeep, Kretinsky, Jan
Partially Observable Markov Decision Processes (POMDPs) are a fundamental framework for decision-making under uncertainty and partial observability. Since in general optimal policies may require infinite memory, they are hard to implement and often render most problems undecidable. Consequently, finite-memory policies are mostly considered instead. However, the algorithms for computing them are typically very complex, and so are the resulting policies. Facing the need for their explainability, we provide a representation of such policies, both (i) in an interpretable formalism and (ii) typically of smaller size, together yielding higher explainability. To that end, we combine models of Mealy machines and decision trees; the latter describing simple, stationary parts of the policies and the former describing how to switch among them. We design a translation for policies of the finite-state-controller (FSC) form from standard literature and show how our method smoothly generalizes to other variants of finite-memory policies. Further, we identify specific properties of recently used "attractor-based" policies, which allow us to construct yet simpler and smaller representations. Finally, we illustrate the higher explainability in a few case studies.
Multi-agent reinforcement learning strategy to maximize the lifetime of Wireless Rechargeable
The thesis proposes a generalized charging framework for multiple mobile chargers to maximize the network lifetime and ensure target coverage and connectivity in large scale WRSNs. Moreover, a multi-point charging model is leveraged to enhance charging efficiency, where the MC can charge multiple sensors simultaneously at each charging location. The thesis proposes an effective Decentralized Partially Observable Semi-Markov Decision Process (Dec POSMDP) model that promotes Mobile Chargers (MCs) cooperation and detects optimal charging locations based on realtime network information. Furthermore, the proposal allows reinforcement algorithms to be applied to different networks without requiring extensive retraining. To solve the Dec POSMDP model, the thesis proposes an Asynchronous Multi Agent Reinforcement Learning algorithm (AMAPPO) based on the Proximal Policy Optimization algorithm (PPO).
Long-term Detection System for Six Kinds of Abnormal Behavior of the Elderly Living Alone
Tanaka, Kai, Kudo, Mineichi, Kimura, Keigo, Nakamura, Atsuyoshi
The proportion of elderly people is increasing worldwide, particularly those living alone in Japan. As elderly people get older, their risks of physical disabilities and health issues increase. To automatically discover these issues at a low cost in daily life, sensor-based detection in a smart home is promising. As part of the effort towards early detection of abnormal behaviors, we propose a simulator-based detection systems for six typical anomalies: being semi-bedridden, being housebound, forgetting, wandering, fall while walking and fall while standing. Our detection system can be customized for various room layout, sensor arrangement and resident's characteristics by training detection classifiers using the simulator with the parameters fitted to individual cases. Considering that the six anomalies that our system detects have various occurrence durations, such as being housebound for weeks or lying still for seconds after a fall, the detection classifiers of our system produce anomaly labels depending on each anomaly's occurrence duration, e.g., housebound per day and falls per second. We propose a method that standardizes the processing of sensor data, and uses a simple detection approach. Although the validity depends on the realism of the simulation, numerical evaluations using sensor data that includes a variety of resident behavior patterns over nine years as test data show that (1) the methods for detecting wandering and falls are comparable to previous methods, and (2) the methods for detecting being semi-bedridden, being housebound, and forgetting achieve a sensitivity of over 0.9 with fewer than one false alarm every 50 days.
Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning
Luo, Zhi, Yang, Xiyuan, Zhou, Pan, Wang, Di
Manipulating the interaction trajectories between the intelligent agent and the environment can control the agent's training and behavior, exposing the potential vulnerabilities of reinforcement learning (RL). For example, in Cyber-Physical Systems (CPS) controlled by RL, the attacker can manipulate the actions of the adopted RL to other actions during the training phase, which will lead to bad consequences. Existing work has studied action-manipulation attacks in tabular settings, where the states and actions are discrete. As seen in many up-and-coming RL applications, such as autonomous driving, continuous action space is widely accepted, however, its action-manipulation attacks have not been thoroughly investigated yet. In this paper, we consider this crucial problem in both white-box and black-box scenarios. Specifically, utilizing the knowledge derived exclusively from trajectories, we propose a black-box attack algorithm named LCBT, which uses the Monte Carlo tree search method for efficient action searching and manipulation. Additionally, we demonstrate that for an agent whose dynamic regret is sub-linearly related to the total number of steps, LCBT can teach the agent to converge to target policies with only sublinear attack cost, i.e., $O\left(\mathcal{R}(T) + MH^3K^E\log (MT)\right)(0
AsymDex: Leveraging Asymmetry and Relative Motion in Learning Bimanual Dexterity
Yang, Zhaodong, Han, Yunhai, Ravichandar, Harish
We present Asymmetric Dexterity (AsymDex), a novel reinforcement learning (RL) framework that can efficiently learn asymmetric bimanual skills for multi-fingered hands without relying on demonstrations, which can be cumbersome to collect. Two crucial ingredients enable AsymDex to reduce the observation and action space dimensions and improve sample efficiency. First, AsymDex leverages the natural asymmetry found in human bimanual manipulation and assigns specific and interdependent roles to each hand: a facilitating hand that moves and reorients the object, and a dominant hand that performs complex manipulations on said object. Second, AsymDex defines and operates over relative observation and action spaces, facilitating responsive coordination between the two hands. Further, AsymDex can be easily integrated with recent advances in grasp learning to handle both the object acquisition phase and the interaction phase of bimanual dexterity. Unlike existing RL-based methods for bimanual dexterity, which are tailored to a specific task, AsymDex can be used to learn a wide variety of bimanual tasks that exhibit asymmetry. Detailed experiments on four simulated asymmetric bimanual dexterous manipulation tasks reveal that AsymDex consistently outperforms strong baselines that challenge its design choices, in terms of success rate and sample efficiency. The project website is at https://sites.google.com/view/asymdex-2024/.
Shrinking POMCP: A Framework for Real-Time UAV Search and Rescue
Zhang, Yunuo, Luo, Baiting, Mukhopadhyay, Ayan, Stojcsics, Daniel, Elenius, Daniel, Roy, Anirban, Jha, Susmit, Maroti, Miklos, Koutsoukos, Xenofon, Karsai, Gabor, Dubey, Abhishek
--Efficient path optimization for drones in search and rescue operations faces challenges, including limited visibility, time constraints, and complex information gathering in urban environments. We present a comprehensive approach to optimize UA V-based search and rescue operations in neighborhood areas, utilizing both a 3D AirSim-ROS2 simulator and a 2D simulator . The path planning problem is formulated as a partially observable Markov decision process (POMDP), and we propose a novel "Shrinking POMCP" approach to address time constraints. In the AirSim environment, we integrate our approach with a probabilistic world model for belief maintenance and a neu-rosymbolic navigator for obstacle avoidance. The 2D simulator employs surrogate ROS2 nodes with equivalent functionality. We compare trajectories generated by different approaches in the 2D simulator and evaluate performance across various belief types in the 3D AirSim-ROS simulator . Experimental results from both simulators demonstrate that our proposed shrinking POMCP solution achieves significant improvements in search times compared to alternative methods, showcasing its potential for enhancing the efficiency of UA V-assisted search and rescue operations. Search and rescue (SAR) operations are critical, time-sensitive missions conducted in challenging environments like neighborhoods, wilderness [1], or maritime settings [2]. These resource-intensive operations require efficient path planning and optimal routing [3]. In recent years, Unmanned Aerial V ehicles (UA Vs) have become valuable SAR assets, offering advantages such as rapid deployment, extended flight times, and access to hard-to-reach areas. Equipped with sensors and cameras, UA Vs can detect heat signatures, identify objects, and provide real-time aerial imagery to search teams [4]. However, the use of UA Vs in SAR operations presents unique challenges, particularly in path planning and decision-making under uncertainty. Factors such as limited battery life, changing weather conditions, and incomplete information about the search area complicate the task of efficiently coordinating UA V movements to maximize the probability of locating targets [3].
On Diffusion Models for Multi-Agent Partial Observability: Shared Attractors, Error Bounds, and Composite Flow
Wang, Tonghan, Dong, Heng, Jiang, Yanchen, Parkes, David C., Tambe, Milind
Multiagent systems grapple with partial observability (PO), and the decentralized POMDP (Dec-POMDP) model highlights the fundamental nature of this challenge. Whereas recent approaches to addressing PO have appealed to deep learning models, providing a rigorous understanding of how these models and their approximation errors affect agents' handling of PO and their interactions remain a challenge. In addressing this challenge, we investigate reconstructing global states from local action-observation histories in Dec-POMDPs using diffusion models. We first find that diffusion models conditioned on local history represent possible states as stable fixed points. In collectively observable (CO) Dec-POMDPs, individual diffusion models conditioned on agents' local histories share a unique fixed point corresponding to the global state, while in non-CO settings, the shared fixed points yield a distribution of possible states given joint history. We further find that, with deep learning approximation errors, fixed points can deviate from true states and the deviation is negatively correlated to the Jacobian rank. Inspired by this low-rank property, we bound the deviation by constructing a surrogate linear regression model that approximates the local behavior of diffusion models. With this bound, we propose a composite diffusion process iterating over agents with theoretical convergence guarantees to the true state.
TransDreamer: Reinforcement Learning with Transformer World Models
Chen, Chang, Wu, Yi-Fu, Yoon, Jaesik, Ahn, Sungjin
The Dreamer agent provides various benefits of Model-Based Reinforcement Learning (MBRL) such as sample efficiency, reusable knowledge, and safe planning. However, its world model and policy networks inherit the limitations of recurrent neural networks and thus an important question is how an MBRL framework can benefit from the recent advances of transformers and what the challenges are in doing so. In this paper, we propose a transformer-based MBRL agent, called TransDreamer. We first introduce the Transformer State-Space Model, a world model that leverages a transformer for dynamics predictions. We then share this world model with a transformer-based policy network and obtain stability in training a transformer-based RL agent. In experiments, we apply the proposed model to 2D visual RL and 3D first-person visual RL tasks both requiring long-range memory access for memory-based reasoning. We show that the proposed model outperforms Dreamer in these complex tasks.
Stream-Based Active Learning for Process Monitoring
Capezza, Christian, Lepore, Antonio, Paynabar, Kamran
Statistical process monitoring (SPM) methods are essential tools in quality management to check the stability of industrial processes, i.e., to dynamically classify the process state as in control (IC), under normal operating conditions, or out of control (OC), otherwise. Traditional SPM methods are based on unsupervised approaches, which are popular because in most industrial applications the true OC states of the process are not explicitly known. This hampered the development of supervised methods that could instead take advantage of process data containing labels on the true process state, although they still need improvement in dealing with class imbalance, as OC states are rare in high-quality processes, and the dynamic recognition of unseen classes, e.g., the number of possible OC states. This article presents a novel stream-based active learning strategy for SPM that enhances partially hidden Markov models to deal with data streams. The ultimate goal is to optimize labeling resources constrained by a limited budget and dynamically update the possible OC states. The proposed method performance in classifying the true state of the process is assessed through a simulation and a case study on the SPM of a resistance spot welding process in the automotive industry, which motivated this research.