LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Niu, Dantong, Sharma, Yuvan, Biamby, Giscard, Quenum, Jerome, Bai, Yutong, Shi, Baifeng, Darrell, Trevor, Herzig, Roei
–arXiv.org Artificial Intelligence
Recently, instruction-tuned Large Multimodal Models (LMMs), such as InstructBLIP [1], Instruct-GPT [2], LLaVA [3, 4], PALM [5] and others have demonstrated state-of-the-art performance on a variety of vision-and-language tasks. However, existing LMMs for robotics [6, 7, 8, 9] do not always demonstrate the same success and consistency across various embodied settings. This may result from the unique challenges encountered in robotics, such as the variability of real-world environments, the differences between robots, and the need to control actions reliably. Since LMMs have been proven to be successful in part due to multimodal instruction tuning, it is natural to leverage this technique in a robotics setting as well. Here, we propose a vision-action instruction tuning method that can bridge the gap between a language model's fundamental pre-training objective--next-word prediction--and the goal of enabling the model to handle various robotics settings. In this work, we introduce our Large LAnguage model for Robotic Vision and Action (LLARVA), an open-source instruction-tuned LMM for robotic applications that can generalize efficiently across various environments and robotic configurations. Our key idea is the formulation of a novel instruction prompt that encapsulates robot type, task, scene configuration, and control regime in a natural language prefix amenable to contemporary LMMs.
arXiv.org Artificial Intelligence
Jun-17-2024
- Country:
- Asia > Japan
- Honshū (0.14)
- Europe > Netherlands (0.14)
- Asia > Japan
- Genre:
- Research Report (0.50)
- Industry:
- Education > Educational Setting (0.34)
- Technology: