Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation
Huang, Zhiqi, Yu, Puxuan, Allan, James
–arXiv.org Artificial Intelligence
Benefiting from transformer-based pre-trained language models, neural ranking models have made significant progress. More recently, the advent of multilingual pre-trained language models provides great support for designing neural cross-lingual retrieval models. However, due to unbalanced pre-training data in different languages, multilingual language models have already shown a performance gap between high and low-resource languages in many downstream tasks. And cross-lingual retrieval models built on such pre-trained models can inherit language bias, leading to suboptimal result for low-resource languages. Moreover, unlike the English-to-English retrieval task, where large-scale training collections for document ranking such as MS MARCO are available, the lack of cross-lingual retrieval data for low-resource language makes it more challenging for training cross-lingual retrieval models. In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training, which is more feasible for low-resource languages. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages, including neural machine translation.
arXiv.org Artificial Intelligence
Jan-29-2023
- Country:
- North America > United States
- Washington > King County
- Seattle (0.04)
- New York > New York County
- New York City (0.04)
- Massachusetts > Hampshire County
- Amherst (0.14)
- California > Los Angeles County
- Los Angeles (0.04)
- Washington > King County
- Europe
- Asia
- Middle East > Israel (0.04)
- Singapore > Central Region
- Singapore (0.05)
- Africa
- Niger (0.04)
- Middle East > Egypt
- Giza Governorate > Giza (0.04)
- North America > United States
- Genre:
- Research Report
- New Finding (0.48)
- Experimental Study (0.46)
- Research Report
- Industry:
- Education (0.68)
- Government (0.46)
- Technology: