Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning
Tavella, Federico, Drinkwater, Amber, Cangelosi, Angelo
–arXiv.org Artificial Intelligence
Vision-Language Models (VLMs) have emerged as powerful tools for generating textual descriptions from visual data. While these models excel on web-scale datasets, their robustness to the domain shifts inherent in many real-world applications remains under-explored. This paper presents a systematic evaluation of VLM performance on a single-view object captioning task when faced with a controlled, physical domain shift. We compare captioning accuracy across two distinct object sets: a collection of multi-material, real-world tools and a set of single-material, 3D-printed items. The 3D-printed set introduces a significant domain shift in texture and material properties, challenging the models' generalization capabilities. Our quantitative results demonstrate that all tested VLMs show a marked performance degradation when describing the 3D-printed objects compared to the real-world tools. This underscores a critical limitation in the ability of current models to generalize beyond surface-level features and highlights the need for more robust architectures for real-world signal processing applications.
arXiv.org Artificial Intelligence
Sep-17-2025
- Genre:
- Research Report > New Finding (0.89)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Machine Learning (1.00)
- Natural Language > Large Language Model (0.70)
- Information Technology > Artificial Intelligence