No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
–Neural Information Processing Systems
We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to attention several important findings. Notably, this performance gap is not captured by - and even at odds with - the currently popular evaluation metrics derived from the Western-centric ImageNet and COCO datasets. Second, pretraining with global, unfiltered data before fine-tuning on English content can improve cultural understanding without sacrificing performance on said popular benchmarks. Third, we introduce the task of geo-localization as a novel evaluation metric to assess cultural diversity in VLMs.
Neural Information Processing Systems
May-27-2025, 15:12:56 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language (0.66)
- Vision (0.66)
- Information Technology > Artificial Intelligence