SERENGETI: Massively Multilingual Language Models for Africa
Adebara, Ife, Elmadany, AbdelRahim, Abdul-Mageed, Muhammad, Inciarte, Alcides Alcoba
–arXiv.org Artificial Intelligence
Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks, achieving 82.27 average F_1. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}}
arXiv.org Artificial Intelligence
May-26-2023
- Country:
- Africa
- Lesotho (0.04)
- Sub-Saharan Africa (0.04)
- Niger (0.05)
- South Africa (0.04)
- West Africa (0.04)
- Nigeria (0.04)
- Eswatini (0.04)
- Middle East (0.04)
- Seychelles (0.04)
- Sudan > South Kordofan State
- Kadugli (0.04)
- Madagascar (0.04)
- Malawi (0.04)
- Senegal (0.04)
- Guinea-Bissau (0.04)
- Cameroon > Littoral Region
- Douala (0.04)
- Asia
- India (0.04)
- Indonesia > Bali (0.04)
- Middle East
- Iran (0.04)
- Israel (0.04)
- Jordan (0.04)
- Republic of Türkiye (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Southeast Asia (0.04)
- Europe
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Netherlands > South Holland
- The Hague (0.04)
- Ukraine > Kyiv Oblast
- Kyiv (0.04)
- Middle East (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Oxfordshire > Oxford (0.04)
- Germany > Saxony
- Leipzig (0.04)
- Spain
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Valencian Community > Valencia Province
- Valencia (0.04)
- Catalonia > Barcelona Province
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Italy > Tuscany
- Florence (0.04)
- Ireland > Leinster
- North America
- Canada > British Columbia
- Dominican Republic (0.04)
- United States
- California (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Massachusetts > Middlesex County
- Somerville (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New York > New York County
- New York City (0.04)
- Texas > Dallas County
- Dallas (0.04)
- Virginia (0.04)
- Washington > King County
- Seattle (0.04)
- Oceania
- Africa
- Genre:
- Research Report > New Finding (0.45)
- Industry:
- Government (1.00)
- Health & Medicine (0.92)
- Information Technology > Security & Privacy (1.00)
- Law (0.92)
- Media > News (0.92)
- Technology: