Goto

Collaborating Authors

 Markov Models


Effort Allocation for Deadline-Aware Task and Motion Planning: A Metareasoning Approach

arXiv.org Artificial Intelligence

Their approach involved modeling the problem using a set of processes, each dedicated to searching for a plan, akin to representing search nodes on an open list. Each process is characterized by a probabilistic performance profile, modeled by a random variable indicating the probability of successful termination given processing time, as well as a random variable modeling the deadline corresponding to each partial plan, which is only revealed after planning is concluded. The meta-level problem lies in finding an optimal schedule of processing time across all processes that maximizes the probability that any process delivers a plan before its deadline. A simplified version of this problem, known as "simplified allocating planning effort when actions expire," assumes discrete time intervals and has been proven to be NP-hard. However, under the condition of known deadlines, the problem becomes solvable in pseudo-polynomial time through dynamic programming. Later, this line of work was extended to consider interleaved planning and execution, where partial plans can be executed during the search [62, 63]. While this body of work bears relevance to our research, it primarily concentrates on deriving symbolic plans. In contrast, our focus lies in elaborating existing symbolic plans through motion-level reasoning to make them executable for a robot, optimizing the likelihood of meeting a pre-specified deadline.


Cooperative and Asynchronous Transformer-based Mission Planning for Heterogeneous Teams of Mobile Robots

arXiv.org Artificial Intelligence

Coordinating heterogeneous teams of mobile robots for tasks such as search and rescue is highly challenging. This is due to the complexities of perception, decision making and planning in such environments, with agents' non-synchronous operation, constrained communication, and limited computational resources. This paper presents the Cooperative and Asynchronous Transformer-based Mission Planning (CATMiP) framework, which leverages multi-agent reinforcement learning (MARL) to effectively coordinate agents with heterogeneous sensing, motion, and actuation capabilities. The framework introduces a Class-based Macro-Action Decentralized Partially Observable Markov Decision Process (CMD-POMDP) model to handle asynchronous decision-making among different agent classes via macro-actions. It also extends the Multi-Agent Transformer (MAT) architecture to facilitate distributed, ad hoc communication among the agents. CATMiP easily adapts to mission complexities and communication constraints, and scales to varying environment sizes and team compositions. Simulations demonstrate its scalability and ability to achieve cooperative mission objectives with two classes of explorer and rescuer agents, even under severe communication constraints. The code is available at https://github.com/mylad13/CATMiP.


Reviews: Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs

Neural Information Processing Systems

This paper describes an extension to the PILCO algorithm (Probabilistic Inference and Learning for COntrol, a data-efficient reinforcement algorithm). The proposed algorithm applies a measurement filtering algorithm during the actual experiment and explicitly takes this measurement filtering algorithm into account during the policy learning step, which uses data from the experiment. This is an important practical extension addressing the fact that measurements are often very noisy. My intuitive explanation for this approach is that the proposed approach makes the overall feedback system more "repeatable" (noise is mostly filtered out) and therefore learning is faster (given that the filtering is effective, see last sentence of the conclusion). The paper presents detailed mathematical derivations and strong simulation results that highlight the properties of the proposed algorithm.


Reviews: Learning Unknown Markov Decision Processes: A Thompson Sampling Approach

Neural Information Processing Systems

The paper proposes TSDE, a posterior sampling algorithm for RL in the average reward infinite horizon setting. This algorithm uses dynamic episodes but unlike Lazy-PSRL avoids technical issues by not only terminating an episode when an observation count doubled but also terminating episodes when they become too long. This ensures that the episode length cannot grow faster than linear and ultimately a Bayesian regret bound of O(HS(AT) .5) is shown. Posterior sampling methods typically outperform UCB-type algorithms and therefore a posterior sampling algorithm for non-episodic RL with rigorous regret bounds is desirable. This paper proposes such an algorithm, which is of high interest.


Reviews: Entropy Rate Estimation for Markov Chains with Large State Space

Neural Information Processing Systems

The paper proposes an entropy estimate for Markov chains by reduction to optimal entropy estimation for i.i.d samples. Sample complexity analysis is provided for different mixing scenarios with a minimax rate established for a particular rate. The estimator is used to assess the capacity of language models. This is a very clear and well-written paper. I appreciate the efforts done by the authors to summarize the results.


Reviews: Modeling Dynamic Missingness of Implicit Feedback for Recommendation

Neural Information Processing Systems

This paper presents H4MF model (HMM MF for dynamic Missingness) for implicit feedback data. With implicit data, we only observe positive feedback and the missing entries (zeros) in the data can indicate either negative feedback or users are not exposed of the items. H4MF is based on the previous work on modeling user latent exposure (ExpoMF, Liang et al., Modeling user exposure in recommendation, 2016) -- the basic idea is that for each user-item pair, there is a latent binary variable to represent exposure; if it's 1, it means this user is exposed to the item thus 0 feedback mean true negative, while if it's 0, it means this user have not yet been exposed to this item yet. The difference in H4MF is that H4MF uses a hidden Markov model to capture the temporal dynamics in the user exposure (user intent in this paper). The basic idea is that whether or not a user is exposed to something can be dependent on some other items he/she has been exposed before.


Reviews: Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes

Neural Information Processing Systems

Summary: This paper presents a new transfer learning approach using Bayesian Neural Network in MDPs. They are building on the existing framework of Hidden Parameter MDPs, and replace the Gaussian process with BNNs, thereby also modeling the joint uncertainty in the latent weights and the state space. Overall, this proposed approach is sound, well developed and seems to help scale the inference. The authors have also shown that it works well by applying it to multiple domains. The paper is extremely well written.


Reviews: Policy-Conditioned Uncertainty Sets for Robust Markov Decision Processes

Neural Information Processing Systems

The authors consider distributionally robust finite MDPs over a finite horizon. The transition probabilities conditionally to a state-action pair should remain at L1-bounded distance from a base measure, which is feasible as being generated using a given reference policy. This is a nice idea. A few comments are mentioned next. Related to that question, why the requirement of staying "close" to this policy would be beneficial.


Reviews: Inverse Filtering for Hidden Markov Models

Neural Information Processing Systems

The paper addresses recovery of the observation sequence given known posterior state estimates, but unknown observations and/or sensor model and also in an extension, noise-corrupted measurements. There is a nice progression of the problem through IP, LP, and MILP followed by a more careful analytical derivation of the answers in the noise-free case, and a seemingly approximate though empirically effective approach (cf. Honestly, most of the motivations seem to be unrealistic, especially the cyber-physical security setting where one does not observe posteriors, but simply an action based on a presumed argmax w.r.t. The EEG application (while somewhat narrow) seems to be the best motivation, however, the sole example is to compare resconstructed observations to a redundant method of sensing -- is this really a compelling application? Is it actually used in practice?


Reviews: Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs

Neural Information Processing Systems

The paper describes a sampling method for learning agent behaviors in interactive POMDPs (I-POMDPs). In general, I-POMDPs are a multi-agent POMDP model which, in addition to a belief about the environment state, the belief space includes nested recursive beliefs about the other agents' models. I-POMDP solutions, including the one proposed in the paper, largely approximate using a finite depth with either intentional models of others (e.g., their nested beliefs, state transitions, optimality criterion, etc.) or subintentional models of others (e.g., essentially "summaries of behavior" such as fictitious play). The proposed approach uses samples of the other agent at a particular depth to compute its values and policy. Related work on an interactive particle filter assumed the full frame was known (b, S, A, Omega, T, R, OC).