Learning Graphical Models
Multi-Agent Reinforcement Learning for Task Offloading in Wireless Edge Networks
Fox, Andrea, De Pellegrini, Francesco, Altman, Eitan
In edge computing systems, autonomous agents must make fast local decisions while competing for shared resources. Existing MARL methods often resume to centralized critics or frequent communication, which fail under limited observability and communication constraints. We propose a decentralized framework in which each agent solves a constrained Markov decision process (CMDP), coordinating implicitly through a shared constraint vector. For the specific case of offloading, e.g., constraints prevent overloading shared server resources. Coordination constraints are updated infrequently and act as a lightweight coordination mechanism. They enable agents to align with global resource usage objectives but require little direct communication. Using safe reinforcement learning, agents learn policies that meet both local and global goals. We establish theoretical guarantees under mild assumptions and validate our approach experimentally, showing improved performance over centralized and independent baselines, especially in large-scale settings.
Embodied Agents Meet Personalization: Investigating Challenges and Solutions Through the Lens of Memory Utilization
Kwon, Taeyoon, Choi, Dongwook, Kim, Hyojun, Kim, Sunghwan, Moon, Seungjun, Kwak, Beong-woo, Huang, Kuan-Hao, Yeo, Jinyoung
LLM-powered embodied agents have shown success on conventional object-rearrangement tasks, but providing personalized assistance that leverages user-specific knowledge from past interactions presents new challenges. We investigate these challenges through the lens of agents' memory utilization along two critical dimensions: object semantics (identifying objects based on personal meaning) and user patterns (recalling sequences from behavioral routines). To assess these capabilities, we construct MEMENTO, an end-to-end two-stage evaluation framework comprising single-memory and joint-memory tasks. Our experiments reveal that current agents can recall simple object semantics but struggle to apply sequential user patterns to planning. Through in-depth analysis, we identify two critical bottlenecks: information overload and coordination failures when handling multiple memories. Based on these findings, we explore memory architectural approaches to address these challenges. Given our observation that episodic memory provides both personalized knowledge and in-context learning benefits, we design a hierarchical knowledge graph-based user-profile memory module that separately manages personalized knowledge, achieving substantial improvements on both single and joint-memory tasks. Project website: https://connoriginal.github.io/MEMENTO
On Multiple Robustness of Proximal Dynamic Treatment Regimes
Gao, Yuanshan, Bai, Yang, Cui, Yifan
Dynamic treatment regimes are sequential decision rules that adapt treatment according to individual time-varying characteristics and outcomes to achieve optimal effects, with applications in precision medicine, personalized recommendations, and dynamic marketing. Estimating optimal dynamic treatment regimes via sequential randomized trials might face costly and ethical hurdles, often necessitating the use of historical observational data. In this work, we utilize proximal causal inference framework for learning optimal dynamic treatment regimes when the unconfoundedness assumption fails. Our contributions are four-fold: (i) we propose three nonparametric identification methods for optimal dynamic treatment regimes; (ii) we establish the semiparametric efficiency bound for the value function of a given regime; (iii) we propose a (K+1)-robust method for learning optimal dynamic treatment regimes, where K is the number of stages; (iv) as a by-product for marginal structural models, we establish identification and estimation of counterfactual means under a static regime. Numerical experiments validate the efficiency and multiple robustness of our proposed methods.
Benchmarking Probabilistic Time Series Forecasting Models on Neural Activity
Lu, Ziyu, Li, Anna J., Ladd, Alexander E., Matveev, Pascha, Deole, Aditya, Shea-Brown, Eric, Kutz, J. Nathan, Steinmetz, Nicholas A.
Neural activity forecasting is central to understanding neural systems and enabling closed-loop control. While deep learning has recently advanced the state-of-the-art in the time series forecasting literature, its application to neural activity forecasting remains limited. To bridge this gap, we systematically evaluated eight probabilistic deep learning models, including two foundation models, that have demonstrated strong performance on general forecasting benchmarks. We compared them against four classical statistical models and two baseline methods on spontaneous neural activity recorded from mouse cortex via widefield imaging. Across prediction horizons, several deep learning models consistently outperformed classical approaches, with the best model producing informative forecasts up to 1.5 seconds into the future. Our findings point toward future control applications and open new avenues for probing the intrinsic temporal structure of neural activity.
On the hardness of RL with Lookahead
Pla, Corentin, Richard, Hugo, Abeille, Marc, Merlis, Nadav, Perchet, Vianney
We study reinforcement learning (RL) with transition look-ahead, where the agent may observe which states would be visited upon playing any sequence of $\ell$ actions before deciding its course of action. While such predictive information can drastically improve the achievable performance, we show that using this information optimally comes at a potentially prohibitive computational cost. Specifically, we prove that optimal planning with one-step look-ahead ($\ell=1$) can be solved in polynomial time through a novel linear programming formulation. In contrast, for $\ell \geq 2$, the problem becomes NP-hard. Our results delineate a precise boundary between tractable and intractable cases for the problem of planning with transition look-ahead in reinforcement learning.
QPPG: Quantum-Preconditioned Policy Gradient for Link Adaptation in Rayleigh Fading Channels
Giwa, Oluwaseyi, Mohsin, Muhammad Ahmed, Adesola, Folarin Jubril, Jamshed, Muhammad Ali
IRELESS communication over fading channels remains one of the fundamental challenges in modern networks. In particular, Rayleigh fading channels, which model rich-scattering non-line-of-sight environments, cause rapid and unpredictable fluctuations in signal strength that can significantly degrade throughput and reliability. To mitigate these effects, link adaptation techniques such as adaptive modulation and coding (AMC) and power control have been extensively studied as key enablers of efficient spectrum use [1], [2]. Early works on link adaptation for Rayleigh fading channels demonstrated how explicit channel estimation and threshold-based switching could improve throughput and maintain robustness under fading conditions [3]-[6]. Despite their success, these classical approaches rely on accurate channel estimation, fixed rules, and often compromise between average throughput and outage probability in a suboptimal manner [4]-[6]. Furthermore, as networks evolve toward 6G with denser topologies and stringent reliability demands, such schemes struggle to scale or adapt to system-level complexities [7], [8]. Recent works have explored deep reinforcement learning (DRL) and meta reinforcement learning (RL) for link adaptation and resource allocation, showing promising adaptability but still facing high sample complexity and training instability [9]-[12]. In this letter, we propose quantum-preconditioned policy gradient (QPPG), a natural actor-critic method for link adap-Oluwaseyi Giwa is with the African Institute for Mathematical Sciences, South Africa (e-mail: {oluwaseyi}@aims.ac.za). Muhammad Ahmed Mohsin is with Stanford University, Stanford, California, 94305, United States (e-mail: {muahmed}@stanford.edu).
ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning
Castanyer, Roger Creus, Mohamed, Faisal, Castro, Pablo Samuel, Neary, Cyrus, Berseth, Glen
Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.
Test-time Prompt Intervention
Yang, Chenxu, Si, Qingyi, Dai, Mz, Yao, Dingyu, Zheng, Mingyu, Chen, Minghui, Lin, Zheng, Wang, Weiping
Test-time compute has led to remarkable success in the large language model (LLM) community, particularly for complex tasks, where longer chains of thought (CoTs) are generated to enhance reasoning capabilities. However, growing evidence reveals that such reasoning models often produce CoTs plagued by excessive redundancy, including unnecessary verification steps and repetitive reasoning shifts. The root cause lies in post-training of them that overly rely on outcome reward paradigms, as the data of process reward paradigms, which regulate intermediate reasoning steps, is difficult to construct at scale. To address this, we propose PI, a novel framework for Test-time Prompt Intervention. PI provides an interface to dynamically guide and regulate reasoning paths during inference through timely (When module) and proper (How module) interventions and post-intervention sampling (Which module). This allows human problem-solving expertise and cognitive science principles to be seamlessly integrated into LLMs' reasoning processes, enhancing controllability and interpretability. Extensive experiments across multiple models and datasets demonstrate that PI significantly shortens CoTs while reducing hallucination, yielding more concise and reliable reasoning.
Neural Variational Dropout Processes
Jeon, Insu, Park, Youngjin, Kim, Gunhee
Learning to infer the conditional posterior model is a key step for robust meta-learning. This paper presents a new Bayesian meta-learning approach called Neural Variational Dropout Processes (NVDPs). NVDPs model the conditional posterior distribution based on a task-specific dropout; a low-rank product of Bernoulli experts meta-model is utilized for a memory-efficient mapping of dropout rates from a few observed contexts. It allows for a quick reconfiguration of a globally learned and shared neural network for new tasks in multi-task few-shot learning. In addition, NVDPs utilize a novel prior conditioned on the whole task data to optimize the conditional \textit{dropout} posterior in the amortized variational inference. Surprisingly, this enables the robust approximation of task-specific dropout rates that can deal with a wide range of functional ambiguities and uncertainties. We compared the proposed method with other meta-learning approaches in the few-shot learning tasks such as 1D stochastic regression, image inpainting, and classification. The results show the excellent performance of NVDPs.
Risk Assessment of an Autonomous Underwater Snake Robot in Confined Operations
The growing interest in ocean discovery imposes a need for inspection and intervention in confined and demanding environments. Eely's slender shape, in addition to its ability to change its body configurations, makes articulated underwater robots an adequate option for such environments. However, operation of Eely in such environments imposes demanding requirements on the system, as it must deal with uncertain and unstructured environments, extreme environmental conditions, and reduced navigational capabilities. This paper proposes a Bayesian approach to assess the risks of losing Eely during two mission scenarios. The goal of this work is to improve Eely's performance and the likelihood of mission success. Sensitivity analysis results are presented in order to demonstrate the causes having the highest impact on losing Eely.