Gitau, Catherine
State of NLP in Kenya: A Survey
Amol, Cynthia Jayne, Chimoto, Everlyn Asiko, Gesicho, Rose Delilah, Gitau, Antony M., Etori, Naome A., Kinyanjui, Caringtone, Ndung'u, Steven, Moruye, Lawrence, Ooko, Samson Otieno, Kitonga, Kavengi, Muhia, Brian, Gitau, Catherine, Ndolo, Antony, Wanzare, Lilian D. A., Kahira, Albert Njoroge, Tombe, Ronald
Kenya, known for its linguistic diversity, faces unique challenges and promising opportunities in advancing Natural Language Processing (NLP) technologies, particularly for its underrepresented indigenous languages. This survey provides a detailed assessment of the current state of NLP in Kenya, emphasizing ongoing efforts in dataset creation, machine translation, sentiment analysis, and speech recognition for local dialects such as Kiswahili, Dholuo, Kikuyu, and Luhya. Despite these advancements, the development of NLP in Kenya remains constrained by limited resources and tools, resulting in the underrepresentation of most indigenous languages in digital spaces. This paper uncovers significant gaps by critically evaluating the available datasets and existing NLP models, most notably the need for large-scale language models and the insufficient digital representation of Indigenous languages. We also analyze key NLP applications: machine translation, information retrieval, and sentiment analysis-examining how they are tailored to address local linguistic needs. Furthermore, the paper explores the governance, policies, and regulations shaping the future of AI and NLP in Kenya and proposes a strategic roadmap to guide future research and development efforts. Our goal is to provide a foundation for accelerating the growth of NLP technologies that meet Kenya's diverse linguistic demands.
Textual Augmentation Techniques Applied to Low Resource Machine Translation: Case of Swahili
Gitau, Catherine, Marivate, VUkosi
In this work we investigate the impact of applying textual data augmentation tasks to low resource machine translation. There has been recent interest in investigating approaches for training systems for languages with limited resources and one popular approach is the use of data augmentation techniques. Data augmentation aims to increase the quantity of data that is available to train the system. In machine translation, majority of the language pairs around the world are considered low resource because they have little parallel data available and the quality of neural machine translation (NMT) systems depend a lot on the availability of sizable parallel corpora. We study and apply three simple data augmentation techniques popularly used in text classification tasks; synonym replacement, random insertion and contextual data augmentation and compare their performance with baseline neural machine translation for English-Swahili (En-Sw) datasets. We also present results in BLEU, ChrF and Meteor scores. Overall, the contextual data augmentation technique shows some improvements both in the $EN \rightarrow SW$ and $SW \rightarrow EN$ directions. We see that there is potential to use these methods in neural machine translation when more extensive experiments are done with diverse datasets.
MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages
Dione, Cheikh M. Bamba, Adelani, David, Nabende, Peter, Alabi, Jesujoba, Sindane, Thapelo, Buzaaba, Happy, Muhammad, Shamsuddeen Hassan, Emezue, Chris Chinenye, Ogayo, Perez, Aremu, Anuoluwapo, Gitau, Catherine, Mbaye, Derguene, Mukiibi, Jonathan, Sibanda, Blessing, Dossou, Bonaventure F. P., Bukula, Andiswa, Mabuya, Rooweither, Tapo, Allahsera Auguste, Munkoh-Buabeng, Edwin, Koagne, victoire Memdjokam, Kabore, Fatoumata Ouoba, Taylor, Amelia, Kalipe, Godson, Macucwa, Tebogo, Marivate, Vukosi, Gwadabe, Tajuddeen, Elvis, Mboning Tchiaze, Onyenwe, Ikechukwu, Atindogbe, Gratien, Adelani, Tolulope, Akinade, Idris, Samuel, Olanrewaju, Nahimana, Marien, Musabeyezu, Thรฉogรจne, Niyomutabazi, Emile, Chimhenga, Ester, Gotosa, Kudzai, Mizha, Patrick, Agbolo, Apelete, Traore, Seydou, Uchechukwu, Chinedu, Yusuf, Aliyu, Abdullahi, Muhammad, Klakow, Dietrich
In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in UD. Evaluating on the MasakhaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with cross-lingual parameter-efficient fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems more effective for POS tagging in unseen languages.
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition
Adelani, David Ifeoluwa, Neubig, Graham, Ruder, Sebastian, Rijhwani, Shruti, Beukman, Michael, Palen-Michel, Chester, Lignos, Constantine, Alabi, Jesujoba O., Muhammad, Shamsuddeen H., Nabende, Peter, Dione, Cheikh M. Bamba, Bukula, Andiswa, Mabuya, Rooweither, Dossou, Bonaventure F. P., Sibanda, Blessing, Buzaaba, Happy, Mukiibi, Jonathan, Kalipe, Godson, Mbaye, Derguene, Taylor, Amelia, Kabore, Fatoumata, Emezue, Chris Chinenye, Aremu, Anuoluwapo, Ogayo, Perez, Gitau, Catherine, Munkoh-Buabeng, Edwin, Koagne, Victoire M., Tapo, Allahsera Auguste, Macucwa, Tebogo, Marivate, Vukosi, Mboning, Elvis, Gwadabe, Tajuddeen, Adewumi, Tosin, Ahia, Orevaoghene, Nakatumba-Nabende, Joyce, Mokono, Neo L., Ezeani, Ignatius, Chukwuneke, Chiamaka, Adeyemi, Mofetoluwa, Hacheme, Gilles Q., Abdulmumin, Idris, Ogundepo, Odunayo, Yousuf, Oreen, Ngoli, Tatiana Moteu, Klakow, Dietrich
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages.
MasakhaNER: Named Entity Recognition for African Languages
Adelani, David Ifeoluwa, Abbott, Jade, Neubig, Graham, D'souza, Daniel, Kreutzer, Julia, Lignos, Constantine, Palen-Michel, Chester, Buzaaba, Happy, Rijhwani, Shruti, Ruder, Sebastian, Mayhew, Stephen, Azime, Israel Abebe, Muhammad, Shamsuddeen, Emezue, Chris Chinenye, Nakatumba-Nabende, Joyce, Ogayo, Perez, Aremu, Anuoluwapo, Gitau, Catherine, Mbaye, Derguene, Alabi, Jesujoba, Yimam, Seid Muhie, Gwadabe, Tajuddeen, Ezeani, Ignatius, Niyongabo, Rubungo Andre, Mukiibi, Jonathan, Otiende, Verrah, Orife, Iroro, David, Davis, Ngom, Samba, Adewumi, Tosin, Rayson, Paul, Adeyemi, Mofetoluwa, Muriuki, Gerald, Anebi, Emmanuel, Chukwuneke, Chiamaka, Odu, Nkiruka, Wairagala, Eric Peter, Oyerinde, Samuel, Siro, Clemencia, Bateesa, Tobius Saul, Oloyede, Temilola, Wambui, Yvonne, Akinode, Victor, Nabagereka, Deborah, Katusiime, Maurice, Awokoya, Ayodele, MBOUP, Mouhamadane, Gebreyohannes, Dibora, Tilaye, Henok, Nwaike, Kelechi, Wolde, Degaga, Faye, Abdoulaye, Sibanda, Blessing, Ahia, Orevaoghene, Dossou, Bonaventure F. P., Ogueji, Kelechi, DIOP, Thierno Ibrahima, Diallo, Abdoulaye, Akinfaderin, Adewale, Marengereke, Tendai, Osei, Salomey
We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.