Multilingual Diversity Improves Vision-Language Representations
–Neural Information Processing Systems
Massive web-crawled image-text datasets lay the foundation for recent progress in multimodal learning. These datasets are designed with the goal of training a model to do well on standard computer vision benchmarks, many of which, however, have been shown to be English-centric (e.g., ImageNet). Consequently, existing data curation techniques gravitate towards using predominantly English image-text pairs and discard many potentially useful non-English samples. Multilingual data is inherently enriching not only because it provides a gateway to learn about culturally salient concepts, but also because it depicts common concepts differently from monolingual data. We thus conduct a systematic study to explore the performance benefits of using more samples of non-English origins with respect to English vision tasks.
Neural Information Processing Systems
Aug-13-2025, 06:03:26 GMT
- Country:
- Africa (0.07)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (0.78)
- Vision (0.60)
- Information Technology > Artificial Intelligence