Multilingual Diversity Improves Vision-Language Representations

Aug-13-2025, 06:03:26 GMT–Neural Information Processing Systems

Massive web-crawled image-text datasets lay the foundation for recent progress in multimodal learning. These datasets are designed with the goal of training a model to do well on standard computer vision benchmarks, many of which, however, have been shown to be English-centric (e.g., ImageNet). Consequently, existing data curation techniques gravitate towards using predominantly English image-text pairs and discard many potentially useful non-English samples. Multilingual data is inherently enriching not only because it provides a gateway to learn about culturally salient concepts, but also because it depicts common concepts differently from monolingual data. We thus conduct a systematic study to explore the performance benefits of using more samples of non-English origins with respect to English vision tasks.

artificial intelligence, machine learning, vision-language representation, (5 more...)

Neural Information Processing Systems

Aug-13-2025, 06:03:26 GMT

Conferences Web Page

Add feedback

Country:
- Africa (0.07)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.78)
  - Vision (0.60)