On the Limitations of Vision-Language Models in Understanding Image Transforms

Anis, Ahmad Mustafa, Ali, Hasnain, Sarfraz, Saquib

Mar-13-2025–arXiv.org Artificial Intelligence

Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Mar-13-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.34)

Industry:
- Media (0.35)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.34)
  - Natural Language > Chatbot (0.54)
  - Vision (1.00)