GlotLID: Language Identification for Low-Resource Languages

Kargaran, Amir Hossein, Imani, Ayyoob, Yvon, François, Schütze, Hinrich

Nov-4-2023–arXiv.org Artificial Intelligence

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.

language identification, natural language processing, resource and evaluation conference, (15 more...)

arXiv.org Artificial Intelligence

Nov-4-2023

arXiv.org PDF

Add feedback

Country:
- South America
  - Paraguay (0.04)
  - Peru
    - Huánuco Department > Huánuco Province
      - Huánuco (0.04)
    - Cusco Department > Cusco Province
      - Cusco (0.04)
    - Arequipa Department > Arequipa Province
      - Arequipa (0.04)
  - Chile > Santiago Metropolitan Region
    - Santiago Province > Santiago (0.04)
  - Argentina > Gran Chaco
    - Santiago del Estero Province > Santiago del Estero (0.04)
- Oceania
  - Tonga (0.04)
  - Tuvalu (0.04)
  - Tokelau (0.04)
  - Papua New Guinea > Morobe Province (0.04)
  - Nauru (0.04)
  - Fiji (0.04)
- North America
  - Belize (0.04)
  - United States
    - Colorado (0.04)
    - Pennsylvania (0.04)
    - Maryland > Baltimore (0.04)
    - Alaska (0.04)
    - Oregon > Multnomah County
      - Portland (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Mexico
    - Puebla (0.04)
    - Oaxaca (0.04)
    - Querétaro (0.04)
    - Michoacán (0.04)
    - Estado de México (0.04)
  - El Salvador > San Salvador
    - San Salvador (0.04)
  - Dominican Republic > Distrito Nacional
    - Santo Domingo (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - Quebec > Montreal (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Russia (0.04)
  - Slovenia (0.04)
  - Belgium (0.04)
  - Ukraine (0.04)
  - Iceland > Capital Region
    - Reykjavik (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Finland > Northern Ostrobothnia
    - Oulu (0.04)
  - Germany
    - Saxony > Leipzig (0.04)
    - Berlin (0.04)
    - Bavaria > Upper Bavaria
      - Munich (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Spain
    - Valencian Community > Valencia Province
      - Valencia (0.04)
    - Catalonia > Barcelona Province
      - Barcelona (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Bulgaria > Sofia City Province
    - Sofia (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Sweden > Vaestra Goetaland
    - Gothenburg (0.04)
  - Middle East > Republic of Türkiye
    - Istanbul Province > Istanbul (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
  - Faroe Islands > Streymoy
    - Tórshavn (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Estonia > Tartu County
    - Tartu (0.04)
- Asia
  - India (0.04)
  - Russia (0.04)
  - South Korea (0.04)
  - China > Shanghai
    - Shanghai (0.04)
  - Myanmar > Chin State
    - Hakha (0.04)
  - Philippines
    - Mindanao
      - Soccsksargen > Province of Sarangani (0.04)
      - Bangsamoro > Province of Maguindanao del Norte
        City of Cotabato (0.04)
    - Luzon > Ilocos Region
      - Province of Pangasinan (0.04)
  - Middle East
    - Israel (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
    - Republic of Türkiye > Istanbul Province
      - Istanbul (0.04)
    - Qatar > Ad-Dawhah
      - Doha (0.04)
    - Iran > Tehran Province
      - Tehran (0.04)
  - Indonesia
    - East Nusa Tenggara > Kupang (0.04)
    - Sulawesi > Gorontalo
      - Gorontalo (0.04)
  - Japan > Kyūshū & Okinawa
    - Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
  - Thailand > Pattani
    - Pattani (0.04)
- Africa
  - Cameroon (0.04)
  - Democratic Republic of the Congo (0.04)
  - Nigeria (0.04)
  - Zambia (0.04)
  - Benin (0.04)
  - Liberia (0.04)
  - Kenya (0.04)
  - Niger (0.04)
  - Sierra Leone (0.04)
  - Malawi (0.04)
  - Equatorial Guinea (0.04)
  - South Sudan (0.04)
  - Tanzania (0.04)
  - Uganda (0.04)
  - Ghana (0.04)
  - Central African Republic (0.04)
  - Ethiopia (0.04)
  - Togo (0.04)
  - Côte d'Ivoire > Goh-Djiboua
    - Gagnoa (0.04)

Genre:
- Research Report > New Finding (0.87)

Industry:
- Media > Television (0.45)
- Health & Medicine > Therapeutic Area
  - Neurology (0.33)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (1.00)
  - Machine Learning > Performance Analysis
    - Accuracy (1.00)