LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

Niu, Dantong, Sharma, Yuvan, Biamby, Giscard, Quenum, Jerome, Bai, Yutong, Shi, Baifeng, Darrell, Trevor, Herzig, Roei

Jun-17-2024–arXiv.org Artificial Intelligence

Recently, instruction-tuned Large Multimodal Models (LMMs), such as InstructBLIP [1], Instruct-GPT [2], LLaVA [3, 4], PALM [5] and others have demonstrated state-of-the-art performance on a variety of vision-and-language tasks. However, existing LMMs for robotics [6, 7, 8, 9] do not always demonstrate the same success and consistency across various embodied settings. This may result from the unique challenges encountered in robotics, such as the variability of real-world environments, the differences between robots, and the need to control actions reliably. Since LMMs have been proven to be successful in part due to multimodal instruction tuning, it is natural to leverage this technique in a robotics setting as well. Here, we propose a vision-action instruction tuning method that can bridge the gap between a language model's fundamental pre-training objective--next-word prediction--and the goal of enabling the model to handle various robotics settings. In this work, we introduce our Large LAnguage model for Robotic Vision and Action (LLARVA), an open-source instruction-tuned LMM for robotic applications that can generalize efficiently across various environments and robotic configurations. Our key idea is the formulation of a novel instruction prompt that encapsulates robot type, task, scene configuration, and control regime in a natural language prefix amenable to contemporary LMMs.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jun-17-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Japan
  - Honshū (0.14)
- Europe > Netherlands (0.14)

Genre:
- Research Report (0.50)

Industry:
- Education > Educational Setting (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language > Large Language Model (1.00)
  - Robots (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found