Goto

Collaborating Authors

 human intention


NOIR 2.0: Neural Signal Operated Intelligent Robots for Everyday Activities

Kim, Tasha, Wang, Yingke, Cho, Hanvit, Hodges, Alex

arXiv.org Artificial Intelligence

Brain-robot interfaces (BRIs) represent a major milestone in the fields of art, science, and engineering. The Neural Signal Operated Intelligent Robots (NOIR) [1], unveiled in 2023, is a versatile, intelligent BRI system that employs non-invasive electroencephalography (EEG). The system operates on the concept of hierarchical shared autonomy, where humans set high-level objectives, and the robot carries out these objectives through the execution of detailed motor commands. At the time of its introduction, NOIR demonstrated its general-purpose nature by being able to handle a variety of tasks (20 everyday activities) and showing broad accessibility, as it requires minimal training for use by the general public. Moreover, NOIR is adaptive and intelligent, equipped with a broad set of skills that enable it to autonomously perform low-level actions. Human intentions are conveyed, interpreted, and executed by the robots through parameterized primitive skills, such as Pick(obj-A) or MoveTo(x,y).


COOPERA: Continual Open-Ended Human-Robot Assistance

Ma, Chenyang, Lu, Kai, Desai, Ruta, Puig, Xavier, Markham, Andrew, Trigoni, Niki

arXiv.org Artificial Intelligence

To understand and collaborate with humans, robots must account for individual human traits, habits, and activities over time. However, most robotic assistants lack these abilities, as they primarily focus on predefined tasks in structured environments and lack a human model to learn from. This work introduces COOPERA, a novel framework for COntinual, OPen-Ended human-Robot Assistance, where simulated humans, driven by psychological traits and long-term intentions, interact with robots in complex environments. By integrating continuous human feedback, our framework, for the first time, enables the study of long-term, open-ended human-robot collaboration (HRC) in different collaborative tasks across various time-scales. Within COOPERA, we introduce a benchmark and an approach to personalize the robot's collaborative actions by learning human traits and context-dependent intents. Experiments validate the extent to which our simulated humans reflect realistic human behaviors and demonstrate the value of inferring and personalizing to human intents for open-ended and long-term HRC. Project Page: https://dannymcy.github.io/coopera/


Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents

Wu, Zheng, Huang, Heyuan, Yang, Yanjia, Song, Yuanyi, Lou, Xingyu, Liu, Weiwen, Zhang, Weinan, Wang, Jun, Zhang, Zhuosheng

arXiv.org Artificial Intelligence

As multimodal large language models advance rapidly, the automation of mobile tasks has become increasingly feasible through the use of mobile-use agents that mimic human interactions from graphical user interface. To further enhance mobile-use agents, previous studies employ demonstration learning to improve mobile-use agents from human demonstrations. However, these methods focus solely on the explicit intention flows of humans (e.g., step sequences) while neglecting implicit intention flows (e.g., personal preferences), which makes it difficult to construct personalized mobile-use agents. In this work, to evaluate the I ntention A lignment R ate between mobile-use agents and humans, we first collect Mo-bileIAR, a dataset containing human-intent-aligned actions and ground-truth actions. This enables a comprehensive assessment of the agents' understanding of human intent. Then we propose IFRAgent, a framework built upon I ntention Flow R ecognition from human demonstrations. IFRAgent analyzes explicit intention flows from human demonstrations to construct a query-level vector library of standard operating procedures (SOP), and analyzes implicit intention flows to build a user-level habit repository. IFRAgent then leverages a SOP extractor combined with retrieval-augmented generation and a query rewriter to generate personalized query and SOP from a raw ambiguous query, enhancing the alignment between mobile-use agents and human intent. Experimental results demonstrate that IFRAgent outperforms baselines by an average of 6.79% (32.06% relative improvement) in human intention alignment rate and improves step completion rates by an average of 5.30% (26.34% relative improvement).


IDAGC: Adaptive Generalized Human-Robot Collaboration via Human Intent Estimation and Multimodal Policy Learning

Liu, Haotian, Tong, Yuchuang, Liu, Guanchen, Ju, Zhaojie, Zhang, Zhengtao

arXiv.org Artificial Intelligence

-- In Human-Robot Collaboration (HRC), which encompasses physical interaction and remote cooperation, accurate estimation of human intentions and seamless switching of collaboration modes to adjust robot behavior remain paramount challenges. T o address these issues, we propose an Intent-Driven Adaptive Generalized Collaboration (IDAGC) framework that leverages multimodal data and human intent estimation to facilitate adaptive policy learning across multi-tasks in diverse scenarios, thereby facilitating autonomous inference of collaboration modes and dynamic adjustment of robotic actions. This framework overcomes the limitations of existing HRC methods, which are typically restricted to a single collaboration mode and lack the capacity to identify and transition between diverse states. Central to our framework is a predictive model that captures the interdependencies among vision, language, force, and robot state data to accurately recognize human intentions with a Conditional V ariational Autoencoder (CV AE) and automatically switch collaboration modes. By employing dedicated encoders for each modality and integrating extracted features through a Transformer decoder, the framework efficiently learns multi-task policies, while force data optimizes compliance control and intent estimation accuracy during physical interactions. Experiments highlights our framework's practical potential to advance the comprehensive development of HRC. Human-Robot Collaboration (HRC) plays a critical role in manufacturing, healthcare, and services [1]-[3], necessitating that robots seamlessly collaborate by accurately estimating human intentions and dynamically adapting to evolving tasks and environments, thereby mitigating the cognitive and physical burdens of human operators. Contemporary HRC comprises physical Human-Robot Interaction (pHRI) and remote cooperation, as shown in Figure 1.


Hierarchical Intention Tracking with Switching Trees for Real-Time Adaptation to Dynamic Human Intentions during Collaboration

Huang, Zhe, Mun, Ye-Ji, Pouria, Fatemeh Cheraghi, Driggs-Campbell, Katherine

arXiv.org Artificial Intelligence

Abstract--During collaborative tasks, human behavior is guided by multiple levels of intentions that evolve over time, such as task sequence preferences and interaction strategies. T o adapt to these changing preferences and promptly correct any inaccurate estimations, collaborative robots must accurately track these dynamic human intentions in real time. We propose a Hierarchical Intention Tracking (HIT) algorithm for collaborative robots to track dynamic and hierarchical human intentions effectively in real time. HIT represents human intentions as intention trees with arbitrary depth, and probabilistically tracks human intentions by Bayesian filtering, upward measurement propagation, and downward posterior propagation across all levels. We develop a HIT-based robotic system that dynamically switches between Interaction-Task and V erification-Task trees for a collaborative assembly task, allowing the robot to effectively coordinate human intentions at three levels: task-level (subtask goal locations), interaction-level (mode of engagement with the robot), and verification-level (confirming or correcting intention recognition). Our user study shows that our HIT-based collaborative robot system surpasses existing collaborative robot solutions by achieving a balance between efficiency, physical workload, and user comfort while ensuring safety and task completion. Post-experiment surveys further reveal that the HIT-based system enhances the user trust and minimizes interruptions to user's task flow through its effective understanding of human intentions across multiple levels. The video demonstrating our experiments is available at https://youtu.be/Y5kg7QC41yw. I. Introduction Robots require an effective understanding of human intentions to collaborate both safely and efficiently with humans. During long-term tasks, human intentions continuously evolve along with task progress. When handling a complex task, humans typically break down the task into milestones and sub-tasks at varying levels of granularity, leading to a hierarchical structure of human intentions. During collaboration, humans often maintain multiple intentions with different semantics simultaneously. For instance, they may prefer specific subtask sequences or modes of interaction with the robot (e.g.


DTRT: Enhancing Human Intent Estimation and Role Allocation for Physical Human-Robot Collaboration

Liu, Haotian, Tong, Yuchuang, Zhang, Zhengtao

arXiv.org Artificial Intelligence

In physical Human-Robot Collaboration (pHRC), accurate human intent estimation and rational human-robot role allocation are crucial for safe and efficient assistance. Existing methods that rely on short-term motion data for intention estimation lack multi-step prediction capabilities, hindering their ability to sense intent changes and adjust human-robot assignments autonomously, resulting in potential discrepancies. To address these issues, we propose a Dual Transformer-based Robot Trajectron (DTRT) featuring a hierarchical architecture, which harnesses human-guided motion and force data to rapidly capture human intent changes, enabling accurate trajectory predictions and dynamic robot behavior adjustments for effective collaboration. Specifically, human intent estimation in DTRT uses two Transformer-based Conditional Variational Autoencoders (CVAEs), incorporating robot motion data in obstacle-free case with human-guided trajectory and force for obstacle avoidance. Additionally, Differential Cooperative Game Theory (DCGT) is employed to synthesize predictions based on human-applied forces, ensuring robot behavior align with human intention. Compared to state-of-the-art (SOTA) methods, DTRT incorporates human dynamics into long-term prediction, providing an accurate understanding of intention and enabling rational role allocation, achieving robot autonomy and maneuverability. Experiments demonstrate DTRT's accurate intent estimation and superior collaboration performance.


How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark

Wen, Ximing, Mainali, Mallika, Sen, Anik

arXiv.org Artificial Intelligence

Understanding human intentions through visual cues is a fundamental aspect of social intelligence, allowing effective communication, collaboration, and interaction [2]. This capability, often referred to as the Theory of Mind (ToM), involves the ability to infer the beliefs, desires, and intentions of others based on observable behaviors and environmental contexts [9, 7, 12]. Recent advances in VLMs have demonstrated impressive abilities in multimodal reasoning, combining visual and textual information to perform complex tasks [5, 10, 13]. However, their capability to perform ToM-like reasoning, specifically in interpreting intentions from visual cues, remains underexplored. For example, Etesam et al. [4] only investigate the emotional component of ToM, instead of exploring more broad categories such as intentions, religions, etc. Jin et al. [6] frame the ToM task as a binary choice question, without requiring VLMs to engage in open-ended reasoning. Consequently, this approach may not fully capture the VLMs' capability to perform ToM tasks. To further highlight, ToM tasks present unique challenges for VLMs, requiring both visual feature extraction and contextual reasoning to infer hidden mental states. Thus, our study, which evaluates VLM performance on ToM tasks through an open-ended question framework, is pivotal to assessing VLMs' capacity for advanced multimodal understanding and social intelligence.


Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning

Zahid, Azizul, Fan, Jie, Wang, Farong, Dy, Ashton, Swaminathan, Sai, Liu, Fei

arXiv.org Artificial Intelligence

Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the "pick and place" task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.


Learning Long Short-Term Intention within Human Daily Behaviors

Sun, Zhe, Wu, Rujie, Yang, Xiaodong, Xie, Hongzhao, Jiang, Haiyan, Bi, Junda, Zhang, Zhenliang

arXiv.org Artificial Intelligence

-- In the domain of autonomous household robots, it is of utmost importance for robots to understand human behaviors and provide appropriate services. This requires the robots to possess the capability to analyze complex human behaviors and predict the true intentions of humans. Traditionally, humans are perceived as flawless, with their decisions acting as the standards that robots should strive to align with. However, this raises a pertinent question: What if humans make mistakes? In this research, we present a unique task, termed "long short-term intention prediction". This task requires robots can predict the long-term intention of humans, which aligns with human values, and the short term intention of humans, which reflects the immediate action intention. Meanwhile, the robots need to detect the potential non-consistency between the short-term and long-term intentions, and provide necessary warnings and suggestions. T o facilitate this task, we propose a long short-term intention model to represent the complex intention states, and build a dataset to train this intention model. Then we propose a two-stage method to integrate the intention model for robots: i) predicting human intentions of both value-based long-term intentions and action-based short-term intentions; and 2) analyzing the consistency between the long-term and short-term intentions. Experimental results indicate that the proposed long short-term intention model can assist robots in comprehending human behavioral patterns over both long-term and short-term durations, which helps determine the consistency between long-term and short-term intentions of humans.


Enhancing Context-Aware Human Motion Prediction for Efficient Robot Handovers

Gómez-Izquierdo, Gerard, Laplaza, Javier, Sanfeliu, Alberto, Garrell, Anaís

arXiv.org Artificial Intelligence

Enhancing Context-A ware Human Motion Prediction for Efficient Robot Handovers Gerard G omez-Izquierdo 1, Javier Laplaza 1, Alberto Sanfeliu 1 and Ana ıs Garrell 1 Abstract -- Accurate human motion prediction (HMP) is critical for seamless human-robot collaboration, particularly in handover tasks that require real-time adaptability. In this work, we enhance human motion forecasting for handover tasks by leveraging siMLPe [1], a lightweight yet powerful architecture, and introducing key improvements. Our approach, named IntentMotion incorporates intention-aware conditioning, task-specific loss functions, and a novel intention classifier, significantly improving motion prediction accuracy while maintaining efficiency. Experimental results demonstrate that our method reduces body loss error by over 50%, achieves 200 faster inference, and requires only 3% of the parameters compared to existing state-of-the-art HMP models. These advancements establish our framework as a highly efficient and scalable solution for real-time human-robot interaction. I. INTRODUCTION Human motion prediction (HMP) plays a crucial role in human-robot collaboration (HRC) by enabling robots to anticipate human movements and respond proactively. This capability is particularly important in handover tasks, where the seamless exchange of objects between humans and robots requires both accuracy and speed. The ability to predict human motion allows robots to preemptively adjust their trajectories, improving efficiency and ensuring safety. In this context, human intention--whether the motion is collaborative or non-collaborative--directly influences the prediction and subsequent robot response.