Goto

Collaborating Authors

 verbal instruction


Reviews: Tagger: Deep Unsupervised Perceptual Grouping

Neural Information Processing Systems

UPDATE: I thank the authors for their convincing rebuttal, and in view of the promised updates on the technical specifications and description of the method, I increased the scores for "Technical quality" and "Clarity and presentation". My only major concern I still have is the lack of a suitable baseline to compare with. In particular, I do not agree that a comparison to [1] is impossible without their code. Instead, I'd encourage the authors to compare their method on the multi-MNIST benchmark described in Figure 1 [1] (and to just use the numbers provided by [1] for comparison without re-simulation). This would significantly strengthen the results. Unfortunately, however, I see two major flaws with the current presentation of the material: ** Literature and comparison to competitors First, the literature on this topic seems not to be suitably accounted for.


Words2Contact: Identifying Support Contacts from Verbal Instructions Using Foundation Models

Totsila, Dionis, Rouxel, Quentin, Mouret, Jean-Baptiste, Ivaldi, Serena

arXiv.org Artificial Intelligence

This paper presents Words2Contact, a language-guided multi-contact placement pipeline leveraging large language models and vision language models. Our method is a key component for language-assisted teleoperation and human-robot cooperation, where human operators can instruct the robots where to place their support contacts before whole-body reaching or manipulation using natural language. Words2Contact transforms the verbal instructions of a human operator into contact placement predictions; it also deals with iterative corrections, until the human is satisfied with the contact location identified in the robot's field of view. We benchmark state-of-the-art LLMs and VLMs for size and performance in contact prediction. We demonstrate the effectiveness of the iterative correction process, showing that users, even naive, quickly learn how to instruct the system to obtain accurate locations. Finally, we validate Words2Contact in real-world experiments with the Talos humanoid robot, instructed by human operators to place support contacts on different locations and surfaces to avoid falling when reaching for distant objects.


Interactive Task Encoding System for Learning-from-Observation

Wake, Naoki, Kanehira, Atsushi, Sasabuchi, Kazuhiro, Takamatsu, Jun, Ikeuchi, Katsushi

arXiv.org Artificial Intelligence

We present the Interactive Task Encoding System (ITES) for teaching robots to perform manipulative tasks. ITES is designed as an input system for the Learning-from-Observation (LfO) framework, which enables household robots to be programmed using few-shot human demonstrations without the need for coding. In contrast to previous LfO systems that rely solely on visual demonstrations, ITES leverages both verbal instructions and interaction to enhance recognition robustness, thus enabling multimodal LfO. ITES identifies tasks from verbal instructions and extracts parameters from visual demonstrations. Meanwhile, the recognition result was reviewed by the user for interactive correction. Evaluations conducted on a real robot demonstrate the successful teaching of multiple operations for several scenarios, suggesting the usefulness of ITES for multimodal LfO. The source code is available at https://github.com/microsoft/symbolic-robot-teaching-interface.


An artificial neural network to acquire grounded representations of robot actions and language

#artificialintelligence

To best assist human users while they complete everyday tasks, robots should be able to understand their queries, answer them and perform actions accordingly. In other words, they should be able to flexibly generate and perform actions that are aligned with a user's verbal instructions. To understand a user's instructions and act accordingly, robotic systems should be able to make associations between linguistic expressions, actions and environments. Deep neural networks have proved to be particularly good at acquiring representations of linguistic expressions, yet they typically need to be trained on large datasets including robot actions, linguistic descriptions and information about different environments. Researchers at Waseda University in Tokyo recently developed a deep neural network that can acquire grounded representations of robot actions and linguistic descriptions of these actions.


Language Bootstrapping: Learning Word Meanings From Perception-Action Association

Salvi, Giampiero, Montesano, Luis, Bernardino, Alexandre, Santos-Victor, José

arXiv.org Machine Learning

We address the problem of bootstrapping language acquisition for an artificial system similarly to what is observed in experiments with human infants. Our method works by associating meanings to words in manipulation tasks, as a robot interacts with objects and listens to verbal descriptions of the interactions. The model is based on an affordance network, i.e., a mapping between robot actions, robot perceptions, and the perceived effects of these actions upon objects. We extend the affordance model to incorporate spoken words, which allows us to ground the verbal symbols to the execution of actions and the perception of the environment. The model takes verbal descriptions of a task as the input and uses temporal co-occurrence to create links between speech utterances and the involved objects, actions, and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot's own understanding of its actions. Thus, they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task. We believe that the encouraging results with our approach may afford robots with a capacity to acquire language descriptors in their operation's environment as well as to shed some light as to how this challenging process develops with human infants.


An Interactive Approach for Situated Task Teaching through Verbal Instructions

Mericli, Cetin (Carnegie Mellon University) | Klee, Steven D. (Carnegie Mellon University) | Paparian, Jack (Carnegie Mellon University) | Veloso, Manuela (Carnegie Mellon University)

AAAI Conferences

The ability to specify a task without having to write special software is an important and prominent feature for a mobile service robot deployed in a crowded office environment, working around and interacting with people. In this paper, we contribute an interactive approach for enabling the users to teach tasks to a mobile service robot through verbal commands. The input is given as typed or spoken instructions, which are then mapped to the available sensing and actuation primitives on the robot. The main contributions of this work are the addition of conditionals on sensory information that the specified actions to be executed in a closed-loop manner, and a correction mode that allows an existing task to be modified or corrected at a later time by providing a replacement action during the test execution. We describe all the components of the system along with the implementation details and illustrative examples in depth. We also discuss the extensibility of the presented system, and point out potential future extensions.