Markov Models
SkillGen: Learning Domain Skills for In-Context Sequential Decision Making
Ding, Ruomeng, Cheng, Wei, Shao, Minglai, Zhao, Chen
Large language models (LLMs) are increasingly applied to sequential decision-making through in-context learning (ICL), yet their effectiveness is highly sensitive to prompt quality. Effective prompts should meet three principles: focus on decision-critical information, provide step-level granularity, and minimize reliance on expert annotations through label efficiency. However, existing ICL methods often fail to satisfy all three criteria simultaneously. Motivated by these challenges, we introduce SkillGen, a skill-based ICL framework for structured sequential reasoning. It constructs an action-centric, domain-level graph from sampled trajectories, identifies high-utility actions via temporal-difference credit assignment, and retrieves step-wise skills to generate fine-grained, context-aware prompts. We further present a theoretical analysis showing that focusing on high-utility segments supports task identifiability and informs more effective ICL prompt design. Experiments on ALFWorld, BabyAI, and ScienceWorld, using both open-source and proprietary LLMs, show that SkillGen achieves consistent gains, improving progress rate by 5.9%-16.5% on average across models.
Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains
Ben, Qingwei, Xu, Botian, Li, Kailin, Jia, Feiyu, Zhang, Wentao, Wang, Jingping, Wang, Jingbo, Lin, Dahua, Pang, Jiangmiao
Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure. This paper presents Gallant, a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency. Experimental results show that Gallant's broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves near-100% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization.
Expert-Guided POMDP Learning for Data-Efficient Modeling in Healthcare
Locatelli, Marco, Hommersom, Arjen, Cerioli, Roberto Clemens, Besozzi, Daniela, Stella, Fabio
Learning the parameters of Partially Observable Markov Decision Processes (POMDPs) from limited data is a significant challenge. We introduce the Fuzzy MAP EM algorithm, a novel approach that incorporates expert knowledge into the parameter estimation process by enriching the Expectation Maximization (EM) framework with fuzzy pseudo-counts derived from an expert-defined fuzzy model. This integration naturally reformulates the problem as a Maximum A Posteriori (MAP) estimation, effectively guiding learning in environments with limited data. In synthetic medical simulations, our method consistently outperforms the standard EM algorithm under both low-data and high-noise conditions. Furthermore, a case study on Myasthenia Gravis illustrates the ability of the Fuzzy MAP EM algorithm to recover a clinically coherent POMDP, demonstrating its potential as a practical tool for data-efficient modeling in healthcare.
Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning
Krohn, Rickmer, Prasad, Vignesh, Tiboni, Gabriele, Chalvatzaki, Georgia
Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control.
LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions
Lin, Xixun, Ning, Yucheng, Zhang, Jingwen, Dong, Yan, Liu, Yilong, Wu, Yongxuan, Qi, Xiaohua, Sun, Nan, Shang, Yanmin, Wang, Kun, Cao, Pengfei, Wang, Qingyue, Zou, Lixin, Chen, Xu, Zhou, Chuan, Wu, Jia, Zhang, Peng, Wen, Qingsong, Pan, Shirui, Wang, Bin, Cao, Yanan, Chen, Kai, Hu, Songlin, Guo, Li
Abstract--Driven by the rapid advancements of Large Language Models (LLMs), LLM-based agents have emerged as powerful intelligent systems capable of human-like cognition, reasoning, and interaction. These agents are increasingly being deployed across diverse real-world applications, including student education, scientific research, and financial analysis. However, despite their remarkable potential, LLM-based agents remain vulnerable to hallucination issues, which can result in erroneous task execution and undermine the reliability of the overall system design. Addressing this critical challenge requires a deep understanding and a systematic consolidation of recent advances on LLM-based agents. T o this end, we present the first comprehensive survey of hallucinations in LLM-based agents. By carefully analyzing the complete workflow of agents, we propose a new taxonomy that identifies different types of agent hallucinations occurring at different stages. Furthermore, we conduct an in-depth examination of eighteen triggering causes underlying the emergence of agent hallucinations. Through a detailed review of a large number of existing studies, we summarize approaches for hallucination mitigation and detection, and highlight promising directions for future research. We hope this survey will inspire further efforts toward addressing hallucinations in LLM-based agents, ultimately contributing to the development of more robust and reliable agent systems. Cao, K. Chen, S. Hu, and L. Guo are with Institute of Information Engineering, Chinese Academy of Sciences, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China. K. Wang is with Nanyang Technological University, Singapore. Cao is with Institute of Automation, Chinese Academy of Sciences, Beijing, China. Q. Wang is with Hong Kong University of Science and Technology, Hong Kong, China. L. Zou is with School of Cyber Science and Engineering, Wuhan University, Wuhan, China. X. Chen is with Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China. C. Zhou is with Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. J. Wu is with School of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, Australia. Zhang is with the Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China. Q. Wen is with Squirrel Ai Learning, Bellevue, USA. S. Pan is with School of Information and Communication Technology, Griffith University, Gold Coast, Australia. B. Wang is with Xiaomi Company, Beijing, China.
Skill-Aligned Fairness in Multi-Agent Learning for Collaboration in Healthcare
Ekpo, Promise Osaine, La, Brian, Wiener, Thomas, Agarwal, Saesha, Agrawal, Arshia, Gonzalez-Pumariega, Gonzalo, Molu, Lekan P., Taylor, Angelique
Fairness in multi-agent reinforcement learning (MARL) is often framed as a workload balance problem, overlooking agent expertise and the structured coordination required in real-world domains. In healthcare, equitable task allocation requires workload balance or expertise alignment to prevent burnout and overuse of highly skilled agents. Workload balance refers to distributing an approximately equal number of subtasks or equalised effort across healthcare workers, regardless of their expertise. We make two contributions to address this problem. First, we propose FairSkillMARL, a framework that defines fairness as the dual objective of workload balance and skill-task alignment. Second, we introduce MARLHospital, a customizable healthcare-inspired environment for modeling team compositions and energy-constrained scheduling impacts on fairness, as no existing simulators are well-suited for this problem. We conducted experiments to compare FairSkillMARL in conjunction with four standard MARL methods, and against two state-of-the-art fairness metrics. Our results suggest that fairness based solely on equal workload might lead to task-skill mismatches and highlight the need for more robust metrics that capture skill-task misalignment. Our work provides tools and a foundation for studying fairness in heterogeneous multi-agent systems where aligning effort with expertise is critical.
Accelerated Distributional Temporal Difference Learning with Linear Function Approximation
Jin, Kaicheng, Peng, Yang, Yang, Jiansheng, Zhang, Zhihua
In this paper, we study the finite-sample statistical rates of distributional temporal difference (TD) learning with linear function approximation. The purpose of distributional TD learning is to estimate the return distribution of a discounted Markov decision process for a given policy. Previous works on statistical analysis of distributional TD learning focus mainly on the tabular case. We first consider the linear function approximation setting and conduct a fine-grained analysis of the linear-categorical Bellman equation. Building on this analysis, we further incorporate variance reduction techniques in our new algorithms to establish tight sample complexity bounds independent of the support size $K$ when $K$ is large. Our theoretical results imply that, when employing distributional TD learning with linear function approximation, learning the full distribution of the return function from streaming data is no more difficult than learning its expectation. This work provide new insights into the statistical efficiency of distributional reinforcement learning algorithms.
Protein Secondary Structure Prediction Using 3D Graphs and Relation-Aware Message Passing Transformers
Varshney, Disha, Garg, Samarth, Tyagi, Sarthak, Varshney, Deeksha, Deep, Nayan, Ekbal, Asif
In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node's nearby region, including relations, we stack multiple con-volutional layers to efficiently learn combined insights from the protein's spatial graph, revealing intricate interconnections and dependencies in its structural To assess our model's performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores. Introduction Proteins serve as essential components within cells and are involved in various applications, spanning from therapeutics to materials. They are composed of a sequence of amino acids that fold into distinct shapes. With the development of affordable sequencing technologies [1, 2], a substantial number of novel protein sequences have been identified in recent times. However, annotating the functional properties of a newly discovered protein sequence is still a laborious and expensive process. Thus, there is a need for reliable and efficient computational methods to accurately predict and assign functions to proteins, thereby bridging the gap between sequence information and functional knowledge. The analysis of protein structure, particularly the tertiary structure, is highly significant for practical applications related to proteins, such as understanding their functions and designing drugs [3].
Asymptotic analysis of cooperative censoring policies in sensor networks
Fernandez-Bes, Jesus, Arroyo-Valles, Rocío, Cid-Sueiro, Jesús
The problem of cooperative data censoring in battery-powered multihop sensor networks is analyzed in this paper. We are interested in scenarios where nodes generate messages (which are related to the sensor measurements) that can be graded with some importance value. Less important messages can be censored in order to save energy for later communications. The problem is modeled using a joint Markov Decision Process of the whole network dynamics, and a theoretically optimal censoring policy, which maximizes a long-term reward, is found. Though the optimal censoring rules are computationally prohibitive, our analysis suggests that, under some conditions, they can be approximated by a finite collection of constant-threshold rules. A centralized algorithm for the computation of these thresholds is proposed. The experimental simulations show that cooperative censoring policies are energy-efficient, and outperform other non-cooperative schemes.