LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages
Agarwal, Milind, Alam, Md Mahfuz Ibn, Anastasopoulos, Antonios
–arXiv.org Artificial Intelligence
Knowing the language of an input text/audio is a necessary first step for using almost every NLP tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, due to lack of data and computational challenges, current systems cannot accurately identify most of the world's 7000 languages. To tackle this bottleneck, we first compile a corpus, MCS-350, of 50K multilingual and parallel children's stories in 350+ languages. MCS-350 can serve as a benchmark for language identification of short texts and for 1400+ new translation directions in low-resource Indian and African languages. Second, we propose a novel misprediction-resolution hierarchical model, LIMIt, for language identification that reduces error by 55% (from 0.71 to 0.32) on our compiled children's stories dataset and by 40% (from 0.23 to 0.14) on the FLORES-200 benchmark. Our method can expand language identification coverage into low-resource languages by relying solely on systemic misprediction patterns, bypassing the need to retrain large models from scratch.
arXiv.org Artificial Intelligence
Nov-6-2023
- Country:
- Africa
- Niger (0.04)
- Sub-Saharan Africa (0.04)
- Asia
- Bangladesh (0.04)
- China > Hong Kong (0.04)
- India (0.04)
- Indonesia > Bali (0.04)
- Japan > Kyūshū & Okinawa
- Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Middle East
- South Korea (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Ukraine (0.04)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Romania > Sud - Muntenia Development Region
- Giurgiu County > Giurgiu (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Spain
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Valencian Community > Valencia Province
- Valencia (0.04)
- Catalonia > Barcelona Province
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Italy > Tuscany
- Florence (0.04)
- Middle East > Malta
- Port Region > Southern Harbour District > Valletta (0.04)
- Austria (0.04)
- Belgium > Brussels-Capital Region
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- Georgia > Fulton County
- Atlanta (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Pennsylvania (0.04)
- Georgia > Fulton County
- Canada > Ontario
- Oceania (0.04)
- South America (0.04)
- Africa
- Genre:
- Research Report (1.00)
- Technology: