English Contrastive Learning Can Learn Universal Cross-lingual Sentence Embeddings
Wang, Yau-Shian, Wu, Ashley, Neubig, Graham
–arXiv.org Artificial Intelligence
Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSimCSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving low-resource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. Our code is publicly available at https://github.com/yaushian/mSimCSE.
arXiv.org Artificial Intelligence
Nov-11-2022
- Country:
- North America
- Cuba (0.04)
- United States
- Texas (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Europe
- Italy > Tuscany
- Florence (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Italy > Tuscany
- Asia > China
- Hong Kong (0.04)
- North America
- Genre:
- Research Report > New Finding (0.68)
- Technology: