Markov Models
Enhancing Large Language Models for Mobility Analytics with Semantic Location Tokenization
Chen, Yile, Tao, Yicheng, Jiang, Yue, Liu, Shuai, Yu, Han, Cong, Gao
The widespread adoption of location-based services has led to the generation of vast amounts of mobility data, providing significant opportunities to model user movement dynamics within urban environments. Recent advancements have focused on adapting Large Language Models (LLMs) for mobility analytics. However, existing methods face two primary limitations: inadequate semantic representation of locations (i.e., discrete IDs) and insufficient modeling of mobility signals within LLMs (i.e., single templated instruction fine-tuning). To address these issues, we propose QT-Mob, a novel framework that significantly enhances LLMs for mobility analytics. QT-Mob introduces a location tokenization module that learns compact, semantically rich tokens to represent locations, preserving contextual information while ensuring compatibility with LLMs. Furthermore, QT-Mob incorporates a series of complementary fine-tuning objectives that align the learned tokens with the internal representations in LLMs, improving the model's comprehension of sequential movement patterns and location semantics. The proposed QT-Mob framework not only enhances LLMs' ability to interpret mobility data but also provides a more generalizable approach for various mobility analytics tasks. Experiments on three real-world dataset demonstrate the superior performance in both next-location prediction and mobility recovery tasks, outperforming existing deep learning and LLM-based methods.
Runtime Safety through Adaptive Shielding: From Hidden Parameter Inference to Provable Guarantees
Kwon, Minjae, Ingebrand, Tyler, Topcu, Ufuk, Feng, Lu
Variations in hidden parameters, such as a robot's mass distribution or friction, pose safety risks during execution. We develop a runtime shielding mechanism for reinforcement learning, building on the formalism of constrained hidden-parameter Markov decision processes. Function encoders enable real-time inference of hidden parameters from observations, allowing the shield and the underlying policy to adapt online. The shield constrains the action space by forecasting future safety risks (such as obstacle proximity) and accounts for uncertainty via conformal prediction. We prove that the proposed mechanism satisfies probabilistic safety guarantees and yields optimal policies among the set of safety-compliant policies. Experiments across diverse environments with varying hidden parameters show that our method significantly reduces safety violations and achieves strong out-of-distribution generalization, while incurring minimal runtime overhead.
Task-Driven Discrete Representation Learning
In recent years, deep discrete representation learning (DRL) has achieved significant success across various domains. Most DRL frameworks (e.g., the widely used VQ-VAE and its variants) have primarily focused on generative settings, where the quality of a representation is implicitly gauged by the fidelity of its generation. In fact, the goodness of a discrete representation remain ambiguously defined across the literature. In this work, we adopt a practical approach that examines DRL from a task-driven perspective. We propose a unified framework that explores the usefulness of discrete features in relation to downstream tasks, with generation naturally viewed as one possible application. In this context, the properties of discrete representations as well as the way they benefit certain tasks are also relatively understudied. We therefore provide an additional theoretical analysis of the trade-off between representational capacity and sample complexity, shedding light on how discrete representation utilization impacts task performance. Finally, we demonstrate the flexibility and effectiveness of our framework across diverse applications.
Shapley Machine: A Game-Theoretic Framework for N-Agent Ad Hoc Teamwork
Wang, Jianhong, Li, Yang, Kaski, Samuel, Lawry, Jonathan
Open multi-agent systems are increasingly important in modeling real-world applications, such as smart grids, swarm robotics, etc. In this paper, we aim to investigate a recently proposed problem for open multi-agent systems, referred to as n-agent ad hoc teamwork (NAHT), where only a number of agents are controlled. Existing methods tend to be based on heuristic design and consequently lack theoretical rigor and ambiguous credit assignment among agents. To address these limitations, we model and solve NAHT through the lens of cooperative game theory. More specifically, we first model an open multi-agent system, characterized by its value, as an instance situated in a space of cooperative games, generated by a set of basis games. We then extend this space, along with the state space, to accommodate dynamic scenarios, thereby characterizing NAHT. Exploiting the justifiable assumption that basis game values correspond to a sequence of n-step returns with different horizons, we represent the state values for NAHT in a form similar to $ฮป$-returns. Furthermore, we derive Shapley values to allocate state values to the controlled agents, as credits for their contributions to the ad hoc team. Different from the conventional approach to shaping Shapley values in an explicit form, we shape Shapley values by fulfilling the three axioms uniquely describing them, well defined on the extended game space describing NAHT. To estimate Shapley values in dynamic scenarios, we propose a TD($ฮป$)-like algorithm. The resulting reinforcement learning (RL) algorithm is referred to as Shapley Machine. To our best knowledge, this is the first time that the concepts from cooperative game theory are directly related to RL concepts. In experiments, we demonstrate the effectiveness of Shapley Machine and verify reasonableness of our theory.
What Exactly Does Guidance Do in Masked Discrete Diffusion Models
Ye, He, Kevin, Rojas, Molei, Tao
Diffusion models have become an influential tool for generative modeling, offering a flexible framework that performs well across a range of data types including images, audio, and text (Dhariwal and Nichol, 2021; Kong et al., 2021; Li et al., 2022; Ho et al., 2022). Originally formulated in continuous state spaces (Ho et al., 2020; Song et al., 2021), these models simulate a forward noising process--typically modeled by a stochastic differential equation and learn a reverse process to denoise and reconstruct the original data. More recently, the diffusion framework has been extended to discrete state spaces (Campbell et al., 2022; Lou et al., 2023), where the forward process is defined via a continuous-time Markov chain over a finite state space. This has enabled generative modeling for discrete domains such as language modeling, molecule generation, and protein design (Lou et al., 2023; Nie et al.; Huang et al., 2023; Gruver et al., 2023). A key innovation that has enhanced the performance and flexibility of diffusion models is guidance, which introduces an auxiliary parameter to steer the reverse process toward desired outputs. In the continuous setting, classifier guidance (Dhariwal and Nichol, 2021) and classifier-free guidance (Ho and Salimans, 2021; Nichol et al., 2022) are widely used for conditional generation based on class labels or text prompts, significantly improving sample quality and alignment with conditioning signals. This technique has been critical to the success of models such as GLIDE (Nichol et al., 2022) and Imagen (Saharia et al., 2022). Theoretical analyses of guided diffusion models in continuous state spaces have examined how guidance modifies the reverse dynamics, most of which focus on simple settings such as low-dimensional and mixture of Gaussian models (Bradley and Nakkiran, 2024; Wu et al., 2024; Chidambaram et al., 2024). Classifier-free guidance (CFG) has also been recently introduced to discrete diffusion models, for applications such as text generation and controlled molecule design (Huang et al., 2023; Nisonoff et al., 2024).
Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding
Zhang, Yuhang, Yu, Haosheng, Xiao, Jiaping, Feroskhan, Mir
--Vision-and-language navigation (VLN) is a long-standing challenge in autonomous robotics, aiming to empower agents with the ability to follow human instructions while navigating complex environments. Two key bottlenecks remain in this field: generalization to out-of-distribution environments and reliance on fixed discrete action spaces. To address these challenges, we propose Vision-Language Fly (VLFly), a framework tailored for Unmanned Aerial Vehicles (UAVs) to execute language-guided flight. Without the requirement for localization or active ranging sensors, VLFly outputs continuous velocity commands purely from egocentric observations captured by an onboard monocular camera. The VLFly integrates three modules: an instruction encoder based on a large language model (LLM) that reformulates high-level language into structured prompts, a goal retriever powered by a vision-language model (VLM) that matches these prompts to goal images via vision-language similarity, and a waypoint planner that generates executable trajectories for real-time UAV control. VLFly is evaluated across diverse simulation environments without additional fine-tuning and consistently outperforms all baselines. Moreover, real-world VLN tasks in indoor and outdoor environments under direct and indirect instructions demonstrate that VLFly achieves robust open-vocabulary goal understanding and generalized navigation capabilities, even in the presence of abstract language input. The capability of robots to execute complex tasks based on natural language instructions has long been a compelling objective in robotics and artificial intelligence (AI). Particularly within the field of autonomous navigation, this capability enables applications spanning home assistance [1], urban inspection [2], and environmental exploration [3]. Y uhang Zhang and Haosheng Y u are co-first authors.
Immersive Multimedia Communication: State-of-the-Art on eXtended Reality Streaming
Wang, Haopeng, Dong, Haiwei, Saddik, Abdulmotaleb El
Extended reality (XR) is rapidly advancing, and poised to revolutionize content creation and consumption. In XR, users integrate various sensory inputs to form a cohesive perception of the virtual environment. This survey reviews the state-of-the-art in XR streaming, focusing on multiple paradigms. To begin, we define XR and introduce various XR headsets along with their multimodal interaction methods to provide a foundational understanding. We then analyze XR traffic characteristics to highlight the unique data transmission requirements. We also explore factors that influence the quality of experience in XR systems, aiming to identify key elements for enhancing user satisfaction. Following this, we present visual attention-based optimization methods for XR streaming to improve efficiency and performance. Finally, we examine current applications and highlight challenges to provide insights into ongoing and future developments of XR.
MoRE: Mixture of Residual Experts for Humanoid Lifelike Gaits Learning on Complex Terrains
Wang, Dewei, Wang, Xinmiao, Liu, Xinzhe, Shi, Jiyuan, Zhao, Yingnan, Bai, Chenjia, Li, Xuelong
--Humanoid robots have demonstrated robust locomotion capabilities using Reinforcement Learning (RL)-based approaches. Further, to obtain human-like behaviors, existing methods integrate human motion-tracking or motion prior in the RL framework. However, these methods are limited in flat terrains with proprioception only, restricting their abilities to traverse challenging terrains with human-like gaits. In this work, we propose a novel framework using a mixture of latent residual experts with multi-discriminators to train an RL policy, which is capable of traversing complex terrains in controllable lifelike gaits with exteroception. Our two-stage training pipeline first teaches the policy to traverse complex terrains using a depth camera, and then enables gait-commanded switching between human-like gait patterns. We also design gait rewards to adjust human-like behaviors like robot base height. Simulation and real-world experiments demonstrate that our framework exhibits exceptional performance in traversing complex terrains, and achieves seamless transitions between multiple human-like gait patterns.
Active inference as a unified model of collision avoidance behavior in human drivers
Schumann, Julian F., Engstrรถm, Johan, Johnson, Leif, O'Kelly, Matthew, Messias, Joao, Kober, Jens, Zgonnikov, Arkady
Collision avoidance -- involving a rapid threat detection and quick execution of the appropriate evasive maneuver -- is a critical aspect of driving. However, existing models of human collision avoidance behavior are fragmented, focusing on specific scenarios or only describing certain aspects of the avoidance behavior, such as response times. This paper addresses these gaps by proposing a novel computational cognitive model of human collision avoidance behavior based on active inference. Active inference provides a unified approach to modeling human behavior: the minimization of free energy. Building on prior active inference work, our model incorporates established cognitive mechanisms such as evidence accumulation to simulate human responses in two distinct collision avoidance scenarios: front-to-rear lead vehicle braking and lateral incursion by an oncoming vehicle. We demonstrate that our model explains a wide range of previous empirical findings on human collision avoidance behavior. Specifically, the model closely reproduces both aggregate results from meta-analyses previously reported in the literature and detailed, scenario-specific effects observed in a recent driving simulator study, including response timing, maneuver selection, and execution. Our results highlight the potential of active inference as a unified framework for understanding and modeling human behavior in complex real-life driving tasks.
Modeling Trust Dynamics in Robot-Assisted Delivery: Impact of Trust Repair Strategies
Mangalindan, Dong Hae, Kandikonda, Karthik, Rovira, Ericka, Srivastava, Vaibhav
With increasing efficiency and reliability, autonomous systems are becoming valuable assistants to humans in various tasks. In the context of robot-assisted delivery, we investigate how robot performance and trust repair strategies impact human trust. In this task, while handling a secondary task, humans can choose to either send the robot to deliver autonomously or manually control it. The trust repair strategies examined include short and long explanations, apology and promise, and denial. Using data from human participants, we model human behavior using an Input-Output Hidden Markov Model (IOHMM) to capture the dynamics of trust and human action probabilities. Our findings indicate that humans are more likely to deploy the robot autonomously when their trust is high. Furthermore, state transition estimates show that long explanations are the most effective at repairing trust following a failure, while denial is most effective at preventing trust loss. We also demonstrate that the trust estimates generated by our model are isomorphic to self-reported trust values, making them interpretable. This model lays the groundwork for developing optimal policies that facilitate real-time adjustment of human trust in autonomous systems.