Goto

Collaborating Authors

 rocksample



Collaborative Decision Making Using Action Suggestions

Neural Information Processing Systems

Inotherp(ost | st) 1(ost = (st)) where 1 indicator introduce 2 (0,1]. Message Reception Rate Reward Normal Perfect Naive - 1.0 Scaled - 0.99 Noisy - 5.0 Chanceof Random Suggestions Reward Normal Perfect Random Naive - 1.0 Naive - 0.25 Scaled - 0.99 Scaled - 0.25 Noisy - 5.0 Noisy - 1.0 Chanceof R...


Appendix

Neural Information Processing Systems

According to Alg. 2, in each exploration, at least one leaf node will be expanded. Moreover, the overall size of the belief tree isO((|A|min(Pδmax,Nmax))D), where Nmax is the maximum sample size given by KLD-Sampling,Pδmax = supb,aPδ(Yb,a), and Yb,a is the set of reachable beliefs after executing actiona at belief b. The tree size is limited sinceNmax is finite. The weights are normalized, i.e., There exist bounded functionsα and α0 such that V (b) = R α(s)b(s)ds, and V (b0) = R α0(s)b0(s)ds. Wecan bound the first and third terms, respectively,byλinlight ofthe assumptions.



Learning to Trust: Bayesian Adaptation to Varying Suggester Reliability in Sequential Decision Making

Asmar, Dylan M., Kochenderfer, Mykel J.

arXiv.org Artificial Intelligence

Autonomous agents operating in sequential decision-making tasks under uncertainty can benefit from external action suggestions, which provide valuable guidance but inherently vary in reliability. Existing methods for incorporating such advice typically assume static and known suggester quality parameters, limiting practical deployment. We introduce a framework that dynamically learns and adapts to varying suggester reliability in partially observable environments. First, we integrate suggester quality directly into the agent's belief representation, enabling agents to infer and adjust their reliance on suggestions through Bayesian inference over suggester types. Second, we introduce an explicit ``ask'' action allowing agents to strategically request suggestions at critical moments, balancing informational gains against acquisition costs. Experimental evaluation demonstrates robust performance across varying suggester qualities, adaptation to changing reliability, and strategic management of suggestion requests. This work provides a foundation for adaptive human-agent collaboration by addressing suggestion uncertainty in uncertain environments.


GammaZero: Learning To Guide POMDP Belief Space Search With Graph Representations

Mangannavar, Rajesh, Tadepalli, Prasad

arXiv.org Artificial Intelligence

We introduce an action-centric graph representation framework for learning to guide planning in Partially Observable Markov Decision Processes (POMDPs). Unlike existing approaches that require domain-specific neural architectures and struggle with scalability, GammaZero leverages a unified graph-based belief representation that enables generalization across problem sizes within a domain. Our key insight is that belief states can be systematically transformed into action-centric graphs where structural patterns learned on small problems transfer to larger instances. We employ a graph neural network with a decoder architecture to learn value functions and policies from expert demonstrations on computationally tractable problems, then apply these learned heuristics to guide Monte Carlo tree search on larger problems. Experimental results on standard POMDP benchmarks demonstrate that GammaZero achieves comparable performance to BetaZero when trained and tested on the same-sized problems, while uniquely enabling zero-shot generalization to problems 2-4 times larger than those seen during training, maintaining solution quality with reduced search requirements. Partially observable Markov decision processes (POMDPs) provide a principled framework for sequential decision-making under uncertainty, where agents must act based on incomplete information about the true state of the environment Kaelbling et al. (1998). This partial observability arises naturally in many real-world applications, from autonomous driving where sensors provide limited field-of-view Hoel et al. (2019), to robotic manipulation where object properties must be inferred through interaction Lauri et al. (2022), to subsurface exploration where underground structures can only be observed at sparse drilling locations Mern & Caers (2023).


Appendix A Different Quality Suggester Results

Neural Information Processing Systems

This section presents results on RockSample (8, 4, 10, 1) when the suggester is not always all-knowing. In our approach, we formulated the belief update based on assuming the suggester observed the environment. These results demonstrate that our approach extends beyond an all-knowing suggester and can incorporate information from suggestions developed from different beliefs of the state. Table 3 contains the mean rewards and table 4 contains the mean number of suggestions considered by the agent. The details of the agents are provided in section 4.2.



Learning Symbolic Persistent Macro-Actions for POMDP Solving Over Time

Veronese, Celeste, Meli, Daniele, Farinelli, Alessandro

arXiv.org Artificial Intelligence

Most popular and effective approaches to online solving Partially Observable Markov Decision Processes (POMDPs, Kaelbling et al. (1998)), e.g., Partially Observable Monte Carlo Planning (POMCP) by Silver and Veness (2010) and Determinized Sparse Partially Observable Tree (DESPOT) by Ye et al. (2017), rely on Monte Carlo Tree Search (MCTS). These approaches are based on online simulations performed in a simulation environment (i.e. a black-box twin of the real POMDP environment) and estimate the value of actions. However, they require domain-specific policy heuristics, suggesting best actions at each state, for efficient exploration. Macro-actions (He et al. (2011); Bertolucci et al. (2021)) are popular policy heuristics that are particularly efficient for long planning horizons. A macro-action is essentially a sequence of suggested actions from a given state that can effectively guide the simulation phase towards actions with high utilities. However, such heuristics are heavily dependent on domain features and are typically handcrafted for each specific domain. Defining these heuristics is an arduous process that requires significant domain knowledge, especially in complex domains. An alternative approach, like the one by Cai and Hsu (2022), is to learn such heuristics via neural networks, which are, however, uninterpretable and data-inefficient. This paper extends the methodology proposed by Meli et al. (2024) to the learning, via Inductive Logic Programming (ILP, Muggleton (1991)), of Event Calculus (EC) theories C. Veronese, D. Meli & A. Farinelli.


Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

Allen, Cameron, Kirtland, Aaron, Tao, Ruo Yu, Lobel, Sam, Scott, Daniel, Petrocelli, Nicholas, Gottesman, Omer, Parr, Ronald, Littman, Michael L., Konidaris, George

arXiv.org Machine Learning

Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to--or knowledge of--an underlying, unobservable state space. Our metric, the $\lambda$-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD($\lambda$) with a different value of $\lambda$. Since TD($\lambda$=0) makes an implicit Markov assumption and TD($\lambda$=1) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the $\lambda$-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the $\lambda$-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different $\lambda$ parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.