Geographically-Informed Language Identification
Dunn, Jonathan, Edwards-Brown, Lane
–arXiv.org Artificial Intelligence
This paper develops an approach to language identification in which the set of languages considered by the model depends on the geographic origin of the text in question. Given that many digital corpora can be geo-referenced at the country level, this paper formulates 16 region-specific models, each of which contains the languages expected to appear in countries within that region. These regional models also each include 31 widely-spoken international languages in order to ensure coverage of these linguae francae regardless of location. An upstream evaluation using traditional language identification testing data shows an improvement in f-score ranging from 1.7 points (Southeast Asia) to as much as 10.4 points (North Africa). A downstream evaluation on social media data shows that this improved performance has a significant impact on the language labels which are applied to large real-world corpora. The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.
arXiv.org Artificial Intelligence
Mar-14-2024
- Country:
- Africa
- Middle East (0.04)
- Niger (0.04)
- North Africa (0.24)
- Sub-Saharan Africa (0.04)
- Asia
- Japan > Honshū
- Kansai > Osaka Prefecture > Osaka (0.04)
- Malaysia (0.04)
- Central Asia (0.04)
- East Asia (0.04)
- Middle East
- Iran (0.04)
- Israel (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Russia (0.05)
- Thailand > Chiang Mai
- Chiang Mai (0.04)
- Southeast Asia (0.24)
- India (0.04)
- Japan > Honshū
- Europe
- Eastern Europe (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Middle East (0.04)
- Russia (0.05)
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- Ukraine (0.04)
- Western Europe (0.05)
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- United States
- California > Los Angeles County
- Los Angeles (0.14)
- Illinois > Champaign County
- Urbana (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- California > Los Angeles County
- Canada
- Oceania (0.05)
- South America > Brazil (0.05)
- Africa
- Genre:
- Research Report
- Experimental Study (0.47)
- New Finding (0.69)
- Research Report
- Technology: