Natural language processing for African languages
–arXiv.org Artificial Intelligence
Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performance. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts. To address the under-representation of the African languages in NLP research, we developed large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks: named entity recognition and machine translation. We conduct an extensive empirical evaluation using state-of-the-art methods across supervised, weakly-supervised, and transfer learning settings.
arXiv.org Artificial Intelligence
Jul-2-2025
- Country:
- Africa
- North Africa (0.13)
- Lesotho (0.04)
- Mali (0.04)
- Southern Africa (0.04)
- Sierra Leone (0.04)
- Sub-Saharan Africa (0.24)
- Ethiopia (0.04)
- Niger (0.05)
- Sudan (0.14)
- Kenya (0.04)
- South Africa
- Gauteng > Pretoria (0.04)
- Kalahari Desert (0.04)
- Namibia > Kalahari Desert (0.04)
- The Gambia (0.04)
- Nigeria
- Federal Capital Territory > Abuja (0.04)
- Ogun State > Abeokuta (0.04)
- Osun State > Ile-Ife (0.04)
- Oyo State > Ibadan (0.04)
- Central African Republic (0.04)
- Ghana (0.04)
- Eswatini (0.04)
- Middle East
- Zimbabwe (0.04)
- Eritrea (0.13)
- Seychelles (0.04)
- Republic of the Congo (0.04)
- Mauritania (0.04)
- Rwanda (0.04)
- Mozambique (0.04)
- Uganda (0.04)
- East Africa (0.04)
- Gabon (0.04)
- South Sudan (0.04)
- Equatorial Guinea (0.04)
- Central Africa (0.04)
- Madagascar (0.04)
- Malawi (0.04)
- Liberia (0.04)
- Angola (0.04)
- Senegal (0.04)
- Cameroon (0.04)
- Burundi (0.04)
- Guinea-Bissau (0.04)
- Democratic Republic of the Congo (0.14)
- Mauritius (0.04)
- Comoros (0.04)
- Benin (0.04)
- Burkina Faso > Est Region (0.04)
- Botswana > Kalahari Desert (0.04)
- Zambia (0.04)
- Côte d'Ivoire (0.04)
- Asia
- China
- Indonesia
- Japan
- Honshū > Kansai
- Osaka Prefecture > Osaka (0.04)
- Kyūshū & Okinawa > Kyūshū
- Miyazaki Prefecture > Miyazaki (0.04)
- Honshū > Kansai
- Middle East
- Iran (0.04)
- Israel (0.04)
- Qatar > Ad-Dawhah
- Doha (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Saudi Arabia (0.04)
- Yemen (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- North Korea (0.04)
- Singapore (0.04)
- Southeast Asia (0.04)
- Europe
- Czechia > Prague (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- United Kingdom
- England
- Greater London > London (0.04)
- Oxfordshire > Oxford (0.04)
- Scotland > City of Edinburgh
- Edinburgh (0.04)
- England
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Sweden
- Uppsala County > Uppsala (0.04)
- Östergötland County > Linköping (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Middle East
- Malta (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Italy
- France
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- Marseille (0.04)
- Île-de-France > Paris
- Paris (0.04)
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- Slovenia (0.04)
- Portugal > Lisbon
- Lisbon (0.13)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Romania > Sud - Muntenia Development Region
- Giurgiu County > Giurgiu (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Germany
- Berlin (0.04)
- Saarland > Saarbrücken (0.04)
- Saxony > Leipzig (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Indian Ocean > Red Sea (0.04)
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Cuba (0.04)
- Dominican Republic (0.04)
- United States
- Colorado > Boulder County
- Boulder (0.04)
- California
- San Diego County > San Diego (0.04)
- San Francisco County > San Francisco (0.13)
- Washington > King County
- Seattle (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Utah > Salt Lake County
- Salt Lake City (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New York > Suffolk County
- Stony Brook (0.04)
- Ohio > Franklin County
- Columbus (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Colorado > Boulder County
- Canada
- Oceania > Australia
- South America > Brazil (0.04)
- Africa
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (1.00)
- Energy (0.92)
- Government > Regional Government (1.00)
- Health & Medicine > Therapeutic Area
- Immunology (0.67)
- Infections and Infectious Diseases (0.67)
- Information Technology (1.00)
- Law (0.92)
- Leisure & Entertainment > Sports (0.67)
- Media > News (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Inductive Learning (1.00)
- Neural Networks > Deep Learning (1.00)
- Statistical Learning (1.00)
- Natural Language
- Chatbot (1.00)
- Large Language Model (1.00)
- Machine Translation (1.00)
- Text Processing (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence