Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval

Jan-18-2025, 18:24:04 GMT–Neural Information Processing Systems

Vision and diverse languages are important information sources in our living world. A model that understands multi-modalities and multi-languages can be applied to a wider range of real-life scenarios. To build such a multimodal and multilingual model, existing works try to ensemble vision-language data from multiple languages in pre-training. However, due to the large number of languages, these works often require huge computing resources and cannot be flexibly extended to new languages. In this work, we propose a MultiLingual Acquisition (MLA) framework that can easily empower a monolingual Vision-Language Pre-training (VLP) model with multilingual capability. Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models.

acquisition, multi-lingual acquisition, multimodal pre-training, (3 more...)

Neural Information Processing Systems

Jan-18-2025, 18:24:04 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)