C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Rouditchenko, Andrew, Chuang, Yung-Sung, Shvetsova, Nina, Thomas, Samuel, Feris, Rogerio, Kingsbury, Brian, Karlinsky, Leonid, Harwath, David, Kuehne, Hilde, Glass, James

May-9-2023–arXiv.org Artificial Intelligence

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers. The code, models, and dataset are available at https://github.com/roudimit/c2kd.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

May-9-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States > Minnesota (0.28)

Genre:
- Research Report (0.82)

Industry:
- Education > Educational Technology (0.49)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Natural Language (1.00)
    - Vision (0.68)
  - Communications (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found