WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

May-27-2025, 22:05:46 GMT–Neural Information Processing Systems

Cross-modal (image-to-text and text-to-image) retrieval is an established task used in evaluation benchmarks to test the performance of vision-language models (VLMs). CLIP, BLIP-2) have achieved near-perfect performance on widely-used image-text retrieval benchmarks such as MSCOCO-Test-5K and Flickr30K-Test-1K. As a measure of out-of-distribution (OOD) generalization, prior works rely on zero-shot performance evaluated on one dataset (Flickr) using a VLM finetuned on another one (MSCOCO). We argue that such comparisons are insufficient to assess the OOD generalization capability of models due to high visual and linguistic similarity between the evaluation and finetuning datasets. To address this gap, we introduce WikiDO (drawn from Wikipedia Diversity Observatory), a novel cross-modal retrieval benchmark to assess the OOD generalization capabilities of pretrained VLMs.

benchmark, cross-modal retrieval, vision-language model, (11 more...)

Neural Information Processing Systems

May-27-2025, 22:05:46 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Communications > Social Media (0.65)
  - Artificial Intelligence
    - Natural Language (0.98)
    - Vision (0.63)