Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

Udandarao, Vishaal, Burg, Max F., Albanie, Samuel, Bethge, Matthias

Dec-6-2023–arXiv.org Artificial Intelligence

Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic \textit{data-types}, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding. Code and datasets are released at https://github.com/bethgelab/DataTypeIdentification.

arxiv preprint arxiv, dataset, learning, (12 more...)

arXiv.org Artificial Intelligence

Dec-6-2023

arXiv.org PDF

Add feedback

Country:
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.14)
  - Switzerland > Zürich
    - Zürich (0.14)
  - Germany
    - Lower Saxony > Gottingen (0.04)
    - Baden-Württemberg > Tübingen Region
      - Tübingen (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Automobiles & Trucks (0.67)
- Transportation > Ground
  - Road (0.67)

Technology:
- Information Technology
  - Databases (1.00)
  - Artificial Intelligence
    - Vision > Image Understanding (0.88)
    - Natural Language
      - Large Language Model (1.00)
      - Text Processing (0.93)
    - Machine Learning
      - Statistical Learning (1.00)
      - Neural Networks > Deep Learning (1.00)
      - Performance Analysis > Accuracy (0.92)