Questioning the Stability of Visual Question Answering
Rosenfeld, Amir, Glazer, Neta, Fetaya, Ethan
–arXiv.org Artificial Intelligence
Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. W e present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescal-ing, paraphrasing, and multilingual rewrites, that do not alter the underlying semantics of an image-question pair . Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. W e characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. W e further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.
arXiv.org Artificial Intelligence
Nov-17-2025
- Country:
- Asia > Singapore (0.04)
- Europe
- Austria (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- North America
- Canada
- Alberta > Census Division No. 15
- Improvement District No. 9 > Banff (0.04)
- British Columbia > Vancouver (0.04)
- Alberta > Census Division No. 15
- United States (0.04)
- Canada
- Genre:
- Research Report (0.70)
- Technology: