R+X: Retrieval and Execution from Everyday Human Videos

Papagiannis, Georgios, Di Palo, Norman, Vitiello, Pietro, Johns, Edward

Jul-17-2024–arXiv.org Artificial Intelligence

We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos are available at https://www.robot-learning.uk/r-plus-x.

distractor, robot, video, (17 more...)

arXiv.org Artificial Intelligence

Jul-17-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Netherlands
  - South Holland > Delft (0.04)
- Asia > Japan
  - Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Natural Language > Large Language Model (0.93)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found