Targeted Multilingual Adaptation for Low-resource Language Families
Downey, C. M., Blevins, Terra, Serai, Dhwani, Parikh, Dwija, Steinert-Threlkeld, Shane
–arXiv.org Artificial Intelligence
The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting.
arXiv.org Artificial Intelligence
May-20-2024
- Country:
- Asia
- Middle East
- Israel (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Russia (0.14)
- Singapore (0.04)
- Middle East
- Europe
- Estonia > Tartu County
- Tartu (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Russia > Northwestern Federal District
- Komi Republic > Syktyvkar (0.04)
- Slovenia > Drava
- Municipality of Benedikt > Benedikt (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Estonia > Tartu County
- North America
- Canada > British Columbia
- Dominican Republic (0.04)
- United States > Washington
- King County > Seattle (0.04)
- Asia
- Genre:
- Research Report
- Experimental Study (0.88)
- New Finding (1.00)
- Research Report
- Technology: