Bridge the Modality and Capability Gaps in Vision-Language Model Selection Chao Yi, Yu-Hang He, De-Chuan Zhan, Han-Jia Ye

May-29-2025, 05:23:19 GMT–Neural Information Processing Systems

Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. To better reuse the VLM resource and fully leverage its potential on different zeroshot image classification tasks, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap"--the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap"-- the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

May-29-2025, 05:23:19 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.14)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Health & Medicine (0.68)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.69)
    - Natural Language > Large Language Model (0.91)
    - Vision > Image Understanding (0.55)
  - Sensing and Signal Processing > Image Processing (1.00)