Goto

Collaborating Authors

 Learning Graphical Models


Common $p$-Belief with Plausibility Measures: Extended Abstract

arXiv.org Artificial Intelligence

Aumann's famous Agreeing to Disagree Theorem states that if a group of agents share a common prior, update their beliefs by Bayesian conditioning based on private information, and have common knowledge of their posterior beliefs regarding some event, these posteriors must be identical. There is an elegant generalization of this theorem by Monderer and Samet, later refined by Neeman: if a group of agents share a common prior, update their beliefs using Bayesian conditioning on private information, and have common p-belief of their posteriors, these posteriors must be close (i.e., they cannot differ by more than 1 - p). Here, common p-belief generalizes the concept of common knowledge to probabilistic beliefs: agents commonly p-believe an event E if everyone believes E to at least degree p, everyone believes to at least degree p that everyone believes E to at least degree p, and so on. This paper further extends the Monderer-Samet-Neeman Agreement Theorem from classical probability measures to plausibility measures -- a very general framework introduced by Halpern that unifies many formal models of belief. To facilitate this extension, we provide a new proof of the Monderer-Samet-Neeman theorem in the classical setting. Building upon both the original proof and our new proof, we offer two different generalizations of the theorem to plausibility-based structures. We then apply these generalized results to several non-classical belief models, including conditional probability structures and lexicographic probability structures. Moreover, we show that whenever our generalized theorems do not apply, the Monderer-Samet-Neeman Agreement Theorem fails. These findings suggest that our results successfully identify the minimal conditions required for a belief model to satisfy the Monderer-Samet-Neeman Agreement Theorem.


Embedded Universal Predictive Intelligence: a coherent framework for multi-agent learning

arXiv.org Artificial Intelligence

The standard theory of model-free reinforcement learning assumes that the environment dynamics are stationary and that agents are decoupled from their environment, such that policies are treated as being separate from the world they inhabit. This leads to theoretical challenges in the multi-agent setting where the non-stationarity induced by the learning of other agents demands prospective learning based on prediction models. To accurately model other agents, an agent must account for the fact that those other agents are, in turn, forming beliefs about it to predict its future behavior, motivating agents to model themselves as part of the environment. Here, building upon foundational work on universal artificial intelligence (AIXI), we introduce a mathematical framework for prospective learning and embedded agency centered on self-prediction, where Bayesian RL agents predict both future perceptual inputs and their own actions, and must therefore resolve epistemic uncertainty about themselves as part of the universe they inhabit. We show that in multi-agent settings, self-prediction enables agents to reason about others running similar algorithms, leading to new game-theoretic solution concepts and novel forms of cooperation unattainable by classical decoupled agents. Moreover, we extend the theory of AIXI, and study universally intelligent embedded agents which start from a Solomonoff prior. We show that these idealized agents can form consistent mutual predictions and achieve infinite-order theory of mind, potentially setting a gold standard for embedded multi-agent learning.


Bayesian Decentralized Decision-making for Multi-Robot Systems: Sample-efficient Estimation of Event Rates

arXiv.org Artificial Intelligence

Abstract-- Effective collective decision-making in swarm robotics often requires balancing exploration, communication and individual uncertainty estimation, especially in hazardous environments where direct measurements are limited or costly. We propose a decentralized Bayesian framework that enables a swarm of simple robots to identify the safer of two areas, each characterized by an unknown rate of hazardous events governed by a Poisson process. Robots employ a conjugate prior to gradually predict the times between events and derive confidence estimates to adapt their behavior . Our simulation results show that the robot swarm consistently chooses the correct area while reducing exposure to hazardous events by being sample-efficient. Compared to baseline heuristics, our proposed approach shows better performance in terms of safety and speed of convergence. The proposed scenario has potential to extend the current set of benchmarks in collective decision-making and our method has applications in adaptive risk-aware sampling and exploration in hazardous, dynamic environments. Collective decision-making under uncertainty is a fundamental challenge in multi-robot systems, including domains such as collective perception, environment classification, and spatial consensus [1]-[4]. Decentralized systems (e.g., robot swarms) operate under strict limitations on sensing, communication, and memory. Instead of sharing/storing complete observation histories, robots must maintain compact model representations of their knowledge. It is crucial to develop efficient strategies for collective decision-making, especially when observations are sparse, noisy [5], and gathered from stochastic processes [6]. This is typically characterized as a best-of-n problem [3], [7].


BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning

arXiv.org Artificial Intelligence

Offline inverse reinforcement learning (IRL) aims to recover a reward function that explains expert behavior using only fixed demonstration data, without any additional online interaction. We propose BiCQL-ML, a policy-free offline IRL algorithm that jointly optimizes a reward function and a conservative Q-function in a bi-level framework, thereby avoiding explicit policy learning. The method alternates between (i) learning a conservative Q-function via Conservative Q-Learning (CQL) under the current reward, and (ii) updating the reward parameters to maximize the expected Q-values of expert actions while suppressing over-generalization to out-of-distribution actions. This procedure can be viewed as maximum likelihood estimation under a soft value matching principle. We provide theoretical guarantees that BiCQL-ML converges to a reward function under which the expert policy is soft-optimal. Empirically, we show on standard offline RL benchmarks that BiCQL-ML improves both reward recovery and downstream policy performance compared to existing offline IRL baselines.


Benchmarking In-context Experiential Learning Through Repeated Product Recommendations

arXiv.org Artificial Intelligence

To reliably navigate ever-shifting real-world environments, agents must grapple with incomplete knowledge and adapt their behavior through experience. However, current evaluations largely focus on tasks that leave no ambiguity, and do not measure agents' ability to adaptively learn and reason through the experiences they accrued. We exemplify the need for this in-context experiential learning in a product recommendation context, where agents must navigate shifting customer preferences and product landscapes through natural language dialogue. We curate a benchmark for experiential learning and active exploration (BELA) that combines (1) rich real-world products from Amazon, (2) a diverse collection of user personas to represent heterogeneous yet latent preferences, and (3) a LLM user simulator powered by the persona to create rich interactive trajectories. We observe that current frontier models struggle to meaningfully improve across episodes, underscoring the need for agentic systems with strong in-context learning capabilities.


Joint Estimation of Sea State and Vessel Parameters Using a Mass-Spring-Damper Equivalence Model

arXiv.org Artificial Intelligence

Real-time sea state estimation is vital for applications like shipbuilding and maritime safety. Traditional methods rely on accurate wave-vessel transfer functions to estimate wave spectra from onboard sensors. In contrast, our approach jointly estimates sea state and vessel parameters without needing prior transfer function knowledge, which may be unavailable or variable. We model the wave-vessel system using pseudo mass-spring-dampers and develop a dynamic model for the system. This method allows for recursive modeling of wave excitation as a time-varying input, relaxing prior works' assumption of a constant input. We derive statistically consistent process noise covariance and implement a square root cubature Kalman filter for sensor data fusion. Further, we derive the Posterior Cramer-Rao lower bound to evaluate estimator performance. Extensive Monte Carlo simulations and data from a high-fidelity validated simulator confirm that the estimated wave spectrum matches methods assuming complete transfer function knowledge.


Actionable and diverse counterfactual explanations incorporating domain knowledge and causal constraints

arXiv.org Artificial Intelligence

Counterfactual explanations enhance the actionable interpretability of machine learning models by identifying the minimal changes required to achieve a desired outcome of the model. However, existing methods often ignore the complex dependencies in real-world datasets, leading to unrealistic or impractical modifications. Motivated by cybersecurity applications in the email marketing domain, we propose a method for generating Diverse, Actionable, and kNowledge-Constrained Explanations (DANCE), which incorporates feature dependencies and causal constraints to ensure plausibility and real-world feasibility of counterfactuals. Our method learns linear and nonlinear constraints from data or integrates expert-provided dependency graphs, ensuring counterfactuals are plausible and actionable. By maintaining consistency with feature relationships, the method produces explanations that align with real-world constraints. Additionally, it balances plausibility, diversity, and sparsity, effectively addressing key limitations in existing algorithms. The work is developed based on a real-life case study with Freshmail, the largest email marketing company in Poland and supported by a joint R&D project Sendguard. Furthermore, we provide an extensive evaluation using 140 public datasets, which highlights its ability to generate meaningful, domain-relevant counterfactuals that outperform other existing approaches based on widely used metrics. The source code for reproduction of the results can be found in a GitHub repository we provide.


Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective

arXiv.org Artificial Intelligence

As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions (what has been interacted with, where it has been, or how it has changed) visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, vision-language-action (VLA) models often struggle in such settings, with token scaling quickly becoming intractable even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM's baseline performance on LIBERO-Mem and general tasks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.


Vectorized Online POMDP Planning

arXiv.org Artificial Intelligence

-- Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most of these solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of paral-lelization. In this paper, we propose V ectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel solver with no dependencies or synchronization bottlenecks between parallel processes. Experimental results indicate that VOPP is at least 20 more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver .


Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning

arXiv.org Artificial Intelligence

To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo's effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints. Our code is available at: https://github.com/gunshi/memo.