Learning Contextualised Cross-lingual Word Embeddings for Extremely Low-Resource Languages Using Parallel Corpora
Wada, Takashi, Iwata, Tomoharu, Matsumoto, Yuji, Baldwin, Timothy, Lau, Jey Han
–arXiv.org Artificial Intelligence
We propose a new approach for learning contextualised cross-lingual word embeddings based only on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM-based encoder-decoder model that performs bidirectional translation and reconstruction of the input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common multilingual space. We also propose a simple method to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations, even in extremely low-resource scenarios.
arXiv.org Artificial Intelligence
Oct-27-2020
- Country:
- South America
- Peru (0.04)
- Colombia > Meta Department
- Villavicencio (0.04)
- Oceania > Australia
- North America
- United States
- Oregon (0.04)
- Maryland (0.04)
- Hawaii (0.04)
- Texas > Travis County
- Austin (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Colorado > Denver County
- Denver (0.04)
- Ohio > Franklin County
- Columbus (0.04)
- Arizona > Maricopa County
- Scottsdale (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Canada > British Columbia
- United States
- Europe
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- Russia > Northwestern Federal District
- Leningrad Oblast > Saint Petersburg (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Italy > Tuscany
- Florence (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Bulgaria > Varna Province
- Varna (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Spain > Valencian Community
- Asia
- Africa
- Middle East > Egypt
- Giza Governorate > Giza (0.05)
- Ethiopia > Addis Ababa
- Addis Ababa (0.04)
- Middle East > Egypt
- South America
- Genre:
- Research Report > New Finding (0.48)
- Technology: