IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks

Hannus, Eric, Malin, Miika, Le, Tran Nguyen, Kyrki, Ville

Sep-30-2025–arXiv.org Artificial Intelligence

Figure 1: Semantically complex language instructions, such as those involving the relative positions of objects, pose a difficult challenge for vision-language-action models (VLAs). To address this problem, we propose IA-VLA, a framework for augmenting the input to VLAs, that offloads the semantic understanding to a larger vision language model (VLM) with greater semantic understanding. We use semantic segmentation to label image regions which a VLM then uses to identify the masks of the task-relevant objects. The task-relevant objects are highlighted in the VLA input, together with the language instruction which can optionally be simplified. Abstract-- Vision-language-action models (VLAs) have become an increasingly popular approach for addressing robot manipulation problems in recent years. However, such models need to output actions at a rate suitable for robot control, which limits the size of the language model they can be based on, and consequently, their language understanding capabilities. Manipulation tasks may require complex language instructions, such as identifying target objects by their relative positions, to specify human intention. Therefore, we introduce IA-VLA, a framework that utilizes the extensive language understanding of a large vision language model as a pre-processing stage to generate improved context to augment the input of a VLA. We evaluate the framework on a set of semantically complex tasks which have been underexplored in VLA literature, namely tasks involving visual duplicates, i.e., visually indistinguishable objects.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Finland (0.15)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.68)
  - Robots (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found