ImageInThat: Manipulating Images to Convey User Instructions to Robots

Mahadevan, Karthik, Lewis, Blaine, Li, Jiannan, Mutlu, Bilge, Tang, Anthony, Grossman, Tovi

Jan-20-2025–arXiv.org Artificial Intelligence

--Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need to be instructed by humans due to model performance, the difficulty of capturing user preferences, and the need for user agency. Robots can be instructed using various methods--natural language conveys immediate instructions but can be abstract or ambiguous, whereas end-user programming supports longer-horizon tasks but interfaces face difficulties in capturing user intent. In this work, we propose using direct manipulation of images as an alternative paradigm to instruct robots, and introduce a specific instantiation called ImageInThat which allows users to perform direct manipulation on images in a timeline-style interface to generate robot instructions. Through a user study, we demonstrate the efficacy of ImageInThat to instruct robots in kitchen manipulation tasks, comparing it to a text-based natural language instruction method. The results show that participants were faster with ImageInThat and preferred to use it over the text-based method. Supplementary material including code can be found at: https://image-in-that.github.io/. Advances in foundation models are rapidly improving the capabilities of autonomous robots, bringing us closer to robots entering our homes where they can complete everyday tasks. However, the need for human instructions will persist-- whether due to limitations in robot policies, models trained on internet-scale data that may not capture the specifics of users' environments or preferences, or simply the desire for users to maintain control over their robots' actions. For instance, a robot asked to wash dishes might follow a standard cleaning routine--e.g., by placing everything in the dishwasher and then putting them away in the cupboard--but may not respect a user's preferences-- e.g., needing to wash delicate glasses "by hand" or organizing cleaned dishes in a specific way--thus necessitating human intervention. We introduce a new paradigm for instructing robots through the direct manipulation of images. ImageInThat is a specific instantiation of this paradigm where users can manipulate images in a timeline-style interface to create instructions for the robot to execute. Existing methods for instructing robots range from those that focus on commanding the robot for the purpose of immediate execution ( e.g., uttering a language instruction to wash glasses by hand [1]) to methods that program the robot such as learning from demonstration [2] or end-user robot programming [3]. However, prior methods, whether they are used for commanding or programming, have notable drawbacks.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jan-20-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Wisconsin (0.14)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.87)
- Workflow (1.00)

Industry:
- Government > Regional Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.49)
  - Robots (1.00)