Luo, Baiting
Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction
Luo, Baiting, Pettet, Ava, Laszka, Aron, Dubey, Abhishek, Mukhopadhyay, Ayan
If we were to apply MCTS directly to this abstracted space, we would encounter two main issues: inefficient utilization of our pre-built search space, with the search potentially diverging prematurely into unexplored regions, and difficulty in building sufficiently deep trees for high-quality long-term decision-making, particularly in areas of high stochasticity or uncertainty (Cou etoux et al., 2011). Therefore, we use progressive widening to extend MCTS to incrementally expand the search tree. It balances the exploration of new states with the exploitation of already visited states based on two hyperparameters: ฮฑ [0, 1] and ฯต R + . Let |C (s, z) | denote the number of children for the state-action pair (s, z) . The key idea is to alternate between adding new child nodes and selecting among existing child nodes, depending on the number of times a state-action pair ( s, z) has been visited. A new state is added to the tree if |C ( s, z)| < ฯต N (s, z) ฮฑ, where N (s, z) is the number of times the state-action pair has been visited. The hyperparameter ฮฑ controls the propensity to select among existing children, with ฮฑ = 0 leading to always selecting among existing child and ฮฑ = 1 leading to vanilla MCTS behavior (always adding a new child). In this way, we could enhance our approach by efficiently utilizing the pre-built search space, prioritizing the exploration of promising macro actions while allowing for incremental expansion of the search tree. This technique enables our method to make quick decisions in an anytime manner, leveraging the cached information, and further refine the planning tree if additional time is available.
NS-Gym: Open-Source Simulation Environments and Benchmarks for Non-Stationary Markov Decision Processes
Keplinger, Nathaniel S., Luo, Baiting, Bektas, Iliyas, Zhang, Yunuo, Wray, Kyle Hollins, Laszka, Aron, Dubey, Abhishek, Mukhopadhyay, Ayan
In many real-world applications, agents must make sequential decisions in environments where conditions are subject to change due to various exogenous factors. These non-stationary environments pose significant challenges to traditional decision-making models, which typically assume stationary dynamics. Non-stationary Markov decision processes (NS-MDPs) offer a framework to model and solve decision problems under such changing conditions. However, the lack of standardized benchmarks and simulation tools has hindered systematic evaluation and advance in this field. We present NS-Gym, the first simulation toolkit designed explicitly for NS-MDPs, integrated within the popular Gymnasium framework. In NS-Gym, we segregate the evolution of the environmental parameters that characterize non-stationarity from the agent's decision-making module, allowing for modular and flexible adaptations to dynamic environments. We review prior work in this domain and present a toolkit encapsulating key problem characteristics and types in NS-MDPs. This toolkit is the first effort to develop a set of standardized interfaces and benchmark problems to enable consistent and reproducible evaluation of algorithms under non-stationary conditions. We also benchmark six algorithmic approaches from prior work on NS-MDPs using NS-Gym. Our vision is that NS-Gym will enable researchers to assess the adaptability and robustness of their decision-making algorithms to non-stationary conditions.
Shrinking POMCP: A Framework for Real-Time UAV Search and Rescue
Zhang, Yunuo, Luo, Baiting, Mukhopadhyay, Ayan, Stojcsics, Daniel, Elenius, Daniel, Roy, Anirban, Jha, Susmit, Maroti, Miklos, Koutsoukos, Xenofon, Karsai, Gabor, Dubey, Abhishek
--Efficient path optimization for drones in search and rescue operations faces challenges, including limited visibility, time constraints, and complex information gathering in urban environments. We present a comprehensive approach to optimize UA V-based search and rescue operations in neighborhood areas, utilizing both a 3D AirSim-ROS2 simulator and a 2D simulator . The path planning problem is formulated as a partially observable Markov decision process (POMDP), and we propose a novel "Shrinking POMCP" approach to address time constraints. In the AirSim environment, we integrate our approach with a probabilistic world model for belief maintenance and a neu-rosymbolic navigator for obstacle avoidance. The 2D simulator employs surrogate ROS2 nodes with equivalent functionality. We compare trajectories generated by different approaches in the 2D simulator and evaluate performance across various belief types in the 3D AirSim-ROS simulator . Experimental results from both simulators demonstrate that our proposed shrinking POMCP solution achieves significant improvements in search times compared to alternative methods, showcasing its potential for enhancing the efficiency of UA V-assisted search and rescue operations. Search and rescue (SAR) operations are critical, time-sensitive missions conducted in challenging environments like neighborhoods, wilderness [1], or maritime settings [2]. These resource-intensive operations require efficient path planning and optimal routing [3]. In recent years, Unmanned Aerial V ehicles (UA Vs) have become valuable SAR assets, offering advantages such as rapid deployment, extended flight times, and access to hard-to-reach areas. Equipped with sensors and cameras, UA Vs can detect heat signatures, identify objects, and provide real-time aerial imagery to search teams [4]. However, the use of UA Vs in SAR operations presents unique challenges, particularly in path planning and decision-making under uncertainty. Factors such as limited battery life, changing weather conditions, and incomplete information about the search area complicate the task of efficiently coordinating UA V movements to maximize the probability of locating targets [3].
Act as You Learn: Adaptive Decision-Making in Non-Stationary Markov Decision Processes
Luo, Baiting, Zhang, Yunuo, Dubey, Abhishek, Mukhopadhyay, Ayan
A fundamental (and largely open) challenge in sequential decision-making is dealing with non-stationary environments, where exogenous environmental conditions change over time. Such problems are traditionally modeled as non-stationary Markov decision processes (NSMDP). However, existing approaches for decision-making in NSMDPs have two major shortcomings: first, they assume that the updated environmental dynamics at the current time are known (although future dynamics can change); and second, planning is largely pessimistic, i.e., the agent acts ``safely'' to account for the non-stationary evolution of the environment. We argue that both these assumptions are invalid in practice -- updated environmental conditions are rarely known, and as the agent interacts with the environment, it can learn about the updated dynamics and avoid being pessimistic, at least in states whose dynamics it is confident about. We present a heuristic search algorithm called \textit{Adaptive Monte Carlo Tree Search (ADA-MCTS)} that addresses these challenges. We show that the agent can learn the updated dynamics of the environment over time and then act as it learns, i.e., if the agent is in a region of the state space about which it has updated knowledge, it can avoid being pessimistic. To quantify ``updated knowledge,'' we disintegrate the aleatoric and epistemic uncertainty in the agent's updated belief and show how the agent can use these estimates for decision-making. We compare the proposed approach with the multiple state-of-the-art approaches in decision-making across multiple well-established open-source problems and empirically show that our approach is faster and highly adaptive without sacrificing safety.
Decision Making in Non-Stationary Environments with Policy-Augmented Search
Pettet, Ava, Zhang, Yunuo, Luo, Baiting, Wray, Kyle, Baier, Hendrik, Laszka, Aron, Dubey, Abhishek, Mukhopadhyay, Ayan
Sequential decision-making under uncertainty is present in many important problems. Two popular approaches for tackling such problems are reinforcement learning and online search (e.g., Monte Carlo tree search). While the former learns a policy by interacting with the environment (typically done before execution), the latter uses a generative model of the environment to sample promising action trajectories at decision time. Decision-making is particularly challenging in non-stationary environments, where the environment in which an agent operates can change over time. Both approaches have shortcomings in such settings -- on the one hand, policies learned before execution become stale when the environment changes and relearning takes both time and computational effort. Online search, on the other hand, can return sub-optimal actions when there are limitations on allowed runtime. In this paper, we introduce \textit{Policy-Augmented Monte Carlo tree search} (PA-MCTS), which combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment. We prove theoretical results showing conditions under which PA-MCTS selects the one-step optimal action and also bound the error accrued while following PA-MCTS as a policy. We compare and contrast our approach with AlphaZero, another hybrid planning approach, and Deep Q Learning on several OpenAI Gym environments. Through extensive experiments, we show that under non-stationary settings with limited time constraints, PA-MCTS outperforms these baselines.
Dynamic Simplex: Balancing Safety and Performance in Autonomous Cyber Physical Systems
Luo, Baiting, Ramakrishna, Shreyas, Pettet, Ava, Kuhn, Christopher, Karsai, Gabor, Mukhopadhyay, Ayan
Learning Enabled Components (LEC) have greatly assisted cyber-physical systems in achieving higher levels of autonomy. However, LEC's susceptibility to dynamic and uncertain operating conditions is a critical challenge for the safety of these systems. Redundant controller architectures have been widely adopted for safety assurance in such contexts. These architectures augment LEC "performant" controllers that are difficult to verify with "safety" controllers and the decision logic to switch between them. While these architectures ensure safety, we point out two limitations. First, they are trained offline to learn a conservative policy of always selecting a controller that maintains the system's safety, which limits the system's adaptability to dynamic and non-stationary environments. Second, they do not support reverse switching from the safety controller to the performant controller, even when the threat to safety is no longer present. To address these limitations, we propose a dynamic simplex strategy with an online controller switching logic that allows two-way switching. We consider switching as a sequential decision-making problem and model it as a semi-Markov decision process. We leverage a combination of a myopic selector using surrogate models (for the forward switch) and a non-myopic planner (for the reverse switch) to balance safety and performance. We evaluate this approach using an autonomous vehicle case study in the CARLA simulator using different driving conditions, locations, and component failures. We show that the proposed approach results in fewer collisions and higher performance than state-of-the-art alternatives.