IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks

Hannus, Eric, Malin, Miika, Le, Tran Nguyen, Kyrki, Ville

arXiv.org Artificial Intelligence 

Figure 1: Semantically complex language instructions, such as those involving the relative positions of objects, pose a difficult challenge for vision-language-action models (VLAs). To address this problem, we propose IA-VLA, a framework for augmenting the input to VLAs, that offloads the semantic understanding to a larger vision language model (VLM) with greater semantic understanding. We use semantic segmentation to label image regions which a VLM then uses to identify the masks of the task-relevant objects. The task-relevant objects are highlighted in the VLA input, together with the language instruction which can optionally be simplified. Abstract-- Vision-language-action models (VLAs) have become an increasingly popular approach for addressing robot manipulation problems in recent years. However, such models need to output actions at a rate suitable for robot control, which limits the size of the language model they can be based on, and consequently, their language understanding capabilities. Manipulation tasks may require complex language instructions, such as identifying target objects by their relative positions, to specify human intention. Therefore, we introduce IA-VLA, a framework that utilizes the extensive language understanding of a large vision language model as a pre-processing stage to generate improved context to augment the input of a VLA. We evaluate the framework on a set of semantically complex tasks which have been underexplored in VLA literature, namely tasks involving visual duplicates, i.e., visually indistinguishable objects.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found