Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions
Oyama, Akira, Hasegawa, Shoichi, Taniguchi, Akira, Hagiwara, Yoshinobu, Taniguchi, Tadahiro
–arXiv.org Artificial Intelligence
-- Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as "Bring me that cup," even when objects or users are out of the robot's view. Existing approaches to exophora resolution primarily rely on visual data and thus fail in real-world scenarios where the object or user is not visible. We propose Multimodal Interactive Exophora resolution with user Localization (MIEL), which is a multimodal exophora resolution framework leveraging sound source localization (SSL), semantic mapping, visual-language models (VLMs), and interactive questioning with GPT -4o. SSL is utilized to orient the robot toward users who are initially outside its visual field, enabling accurate identification of user gestures and pointing directions. When ambiguities remain, the robot proactively interacts with the user, employing GPT -4o to formulate clarifying questions. Experiments in a real-world environment showed results that were approximately 1.3 times better when the user was visible to the robot and 2.0 times better when the user was not visible to the robot, compared to the methods without SSL and interactive questioning. In our daily life, we frequently use verbal instructions that include demonstratives, such as "Take that for me," but for robots, the target object is often unclear and the user or object is often not in the robot's view. One of the challenges in the field of robotics is enabling daily life support robots to understand and execute tasks based on such instructions and situations [1]. To achieve this, implementing exophora resolution [2], [3] is essential. Exophora resolution involves identifying the referent --whether a person or object -- associated with anaphora (demonstratives or pronouns) within utterances, based on the surrounding context of the speaker or listener. For instance, if a user instructs the robot to "Bring me that cup," the robot must identify the target object corresponding to "that cup," even if there are many cups in the environment.
arXiv.org Artificial Intelligence
Aug-25-2025
- Country:
- Asia > Japan > Honshū
- Kansai
- Kyoto Prefecture > Kyoto (0.05)
- Osaka Prefecture > Osaka (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.04)
- Kansai
- Asia > Japan > Honshū
- Genre:
- Research Report (0.82)
- Technology: