IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks
Hannus, Eric, Malin, Miika, Le, Tran Nguyen, Kyrki, Ville
–arXiv.org Artificial Intelligence
Figure 1: Semantically complex language instructions, such as those involving the relative positions of objects, pose a difficult challenge for vision-language-action models (VLAs). To address this problem, we propose IA-VLA, a framework for augmenting the input to VLAs, that offloads the semantic understanding to a larger vision language model (VLM) with greater semantic understanding. We use semantic segmentation to label image regions which a VLM then uses to identify the masks of the task-relevant objects. The task-relevant objects are highlighted in the VLA input, together with the language instruction which can optionally be simplified. Abstract-- Vision-language-action models (VLAs) have become an increasingly popular approach for addressing robot manipulation problems in recent years. However, such models need to output actions at a rate suitable for robot control, which limits the size of the language model they can be based on, and consequently, their language understanding capabilities. Manipulation tasks may require complex language instructions, such as identifying target objects by their relative positions, to specify human intention. Therefore, we introduce IA-VLA, a framework that utilizes the extensive language understanding of a large vision language model as a pre-processing stage to generate improved context to augment the input of a VLA. We evaluate the framework on a set of semantically complex tasks which have been underexplored in VLA literature, namely tasks involving visual duplicates, i.e., visually indistinguishable objects.
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Europe
- Denmark (0.04)
- Finland > Northern Ostrobothnia
- Oulu (0.04)
- Switzerland (0.04)
- Europe
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language > Large Language Model (0.68)
- Robots (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence