The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP
Issaka, Sheriff, Wang, Keyi, Ajibola, Yinka, Samuel-Ipaye, Oluwatumininu, Zhang, Zhaoyi, Jimenez, Nicte Aguillon, Agyei, Evans Kofi, Lin, Abraham, Ramachandran, Rohan, Mumin, Sadick Abdul, Nchifor, Faith, Shuraim, Mohammed, Liu, Lieqi, Gonzalez, Erick Rosas, Kpei, Sylvester, Osei, Jemimah, Ajeneza, Carlene, Boateng, Persis, Yeboah, Prisca Adwoa Dufie, Gabriel, Saadia
–arXiv.org Artificial Intelligence
Despite representing nearly one-third of the world's languages, African languages remain critically underserved by modern NLP technologies, with 88\% classified as severely underrepresented or completely ignored in computational linguistics. We present the African Languages Lab (All Lab), a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and capacity building. Our contributions include: (1) a quality-controlled data collection pipeline, yielding the largest validated African multi-modal speech and text dataset spanning 40 languages with 19 billion tokens of monolingual text and 12,628 hours of aligned speech data; (2) extensive experimental validation demonstrating that our dataset, combined with fine-tuning, achieves substantial improvements over baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points across 31 evaluated languages; and (3) a structured research program that has successfully mentored fifteen early-career researchers, establishing sustainable local capacity. Our comparative evaluation against Google Translate reveals competitive performance in several languages while identifying areas that require continued development.
arXiv.org Artificial Intelligence
Oct-8-2025
- Country:
- Africa
- Cameroon (0.04)
- Côte d'Ivoire (0.04)
- Ghana > Central Region
- Cape Coast (0.04)
- Kenya (0.04)
- Mali (0.04)
- Nigeria (0.04)
- Zambia > Southern Province
- Choma (0.04)
- Asia
- Indonesia > Bali (0.04)
- Japan > Honshū
- Chūbu > Toyama Prefecture > Toyama (0.04)
- Middle East
- Israel (0.04)
- Jordan (0.04)
- Qatar (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Saudi Arabia > Asir Province
- Abha (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Italy > Tuscany
- Florence (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Monaco (0.04)
- Slovenia (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America
- Canada > Ontario
- National Capital Region > Ottawa (0.04)
- Toronto (0.04)
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California > Los Angeles County
- Los Angeles (0.14)
- Florida > Miami-Dade County
- Miami (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Wisconsin > Dane County
- Madison (0.04)
- California > Los Angeles County
- Canada > Ontario
- Africa
- Genre:
- Research Report (0.81)
- Industry:
- Education (0.67)
- Information Technology (0.67)
- Technology: