Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models

Liu, Huihan, Shah, Rutav, Liu, Shuijing, Pittenger, Jack, Seo, Mingyo, Cui, Yuchen, Bisk, Yonatan, Martín-Martín, Roberto, Zhu, Yuke

arXiv.org Artificial Intelligence 

Deploying robots in human-centric settings like households requires balancing robot autonomy with humans' sense of agency [1, 2, 3, 4, 5, 6]. Full teleoperation offers users fine-grained control but imposes a high cognitive load, whereas fully autonomous robots act independently but often misalign their actions with nuanced human needs. Assistive teleoperation -- a paradigm in which both the human and the robot share control [7, 8, 9, 10] -- has thus emerged as an ideal middle ground. By keeping the user in control of high-level decisions while delegating low-level actions to the autonomous robot, this approach both preserves user agency and enhances overall system performance. As such, assistive teleoperation is becoming a desirable paradigm for robots to serve as reliable partners in human-centric environments, such as assisting individuals with motor impairments [11, 12]. While promising, assistive teleoperation in everyday environments remains challenging. A longstanding challenge in assistive teleoperation is to infer human intents from user control inputs and assist users with correct actions [8]. This challenge is amplified in real-world settings, where robots must go beyond closed-set intent prediction [13, 14] to handle diverse, open-ended user goals across different contexts and scenes. As a result, a key capability the robot should possess is to interpret user control inputs within the visual context and infer intent through commonsense reasoning.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found