kiswahili
Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo
Mbogho, Audrey, Awuor, Quin, Kipkebut, Andrew, Wanzare, Lilian, Oloo, Vivian
Natural Language Processing is a crucial frontier in artificial intelligence, with broad applications in many areas, including public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw'ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year, employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording conversations and translation of the resulting text into Kiswahili, thereby creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thus facilitating ongoing contributions and access for developers to train models and develop Natural Language Processing applications. The project demonstrates how grassroots efforts in corpus building can support the inclusion of African languages in artificial intelligence innovations. In addition to filling resource gaps, these corpora are vital in promoting linguistic diversity and empowering local communities by enabling Natural Language Processing applications tailored to their needs. As African countries like Kenya increasingly embrace digital transformation, developing indigenous language resources becomes essential for inclusive growth. We encourage continued collaboration from native speakers and developers to expand and utilize these corpora.
Algorithm for Semantic Network Generation from Texts of Low Resource Languages Such as Kiswahili
Wanjawa, Barack Wamkaya, Muchemi, Lawrence, Miriti, Evans
Box 30197 Nairobi 00100, Kenya eamiriti@uonbi.ac.ke Abstract Processing low-resource languages, such as Kiswahili, using machine learning is difficult due to lack of adequate training data. However, such low-resource languages are still important for human communication and are already in daily use and users need practical machine processing tasks such as summarization, disambiguation and even question answering (QA). One method of processing such languages, while bypassing the need for training data, is the use semantic networks. Some low resource languages, such as Kiswahili, are of the subject-verb-object (SVO) structure, and similarly semantic networks are a triple of subject-predicate-object, hence SVO parts of speech tags can map into a semantic network triple. An algorithm to process raw natural language text and map it into a semantic network is therefore necessary and desirable in structuring low resource languages texts. This algorithm tested on the Kiswahili QA task with upto 78.6% exact match. Highlights Languages, both low and high-resource are important for communication. Low resource languages lack vast data repositories necessary for machine learning. Use of language part of speech tags can create meaning from the language. An algorithm can create semantic networks out of the language parts of speech. The semantic network of the language can do practical tasks such as QA.
State of NLP in Kenya: A Survey
Amol, Cynthia Jayne, Chimoto, Everlyn Asiko, Gesicho, Rose Delilah, Gitau, Antony M., Etori, Naome A., Kinyanjui, Caringtone, Ndung'u, Steven, Moruye, Lawrence, Ooko, Samson Otieno, Kitonga, Kavengi, Muhia, Brian, Gitau, Catherine, Ndolo, Antony, Wanzare, Lilian D. A., Kahira, Albert Njoroge, Tombe, Ronald
Kenya, known for its linguistic diversity, faces unique challenges and promising opportunities in advancing Natural Language Processing (NLP) technologies, particularly for its underrepresented indigenous languages. This survey provides a detailed assessment of the current state of NLP in Kenya, emphasizing ongoing efforts in dataset creation, machine translation, sentiment analysis, and speech recognition for local dialects such as Kiswahili, Dholuo, Kikuyu, and Luhya. Despite these advancements, the development of NLP in Kenya remains constrained by limited resources and tools, resulting in the underrepresentation of most indigenous languages in digital spaces. This paper uncovers significant gaps by critically evaluating the available datasets and existing NLP models, most notably the need for large-scale language models and the insufficient digital representation of Indigenous languages. We also analyze key NLP applications: machine translation, information retrieval, and sentiment analysis-examining how they are tailored to address local linguistic needs. Furthermore, the paper explores the governance, policies, and regulations shaping the future of AI and NLP in Kenya and proposes a strategic roadmap to guide future research and development efforts. Our goal is to provide a foundation for accelerating the growth of NLP technologies that meet Kenya's diverse linguistic demands.
Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks
Wanjawa, Barack, Wanzare, Lilian, Indede, Florence, McOnyango, Owen, Ombui, Edward, Muchemi, Lawrence
Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya. Data collection was done by researchers from communities, schools, media, and publishers. The Kencorpus' dataset has a collection of 5,594 items - 4,442 texts (5.6M words) and 1,152 speech files (177hrs). Based on this data, Part of Speech tagging sets for Dholuo and Luhya (50,000 and 93,000 words respectively) were developed. We developed 7,537 Question-Answer pairs for Swahili and created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. We also developed two proof of concept systems: for Kiswahili speech-to-text and machine learning system for Question Answering task, with results of 18.87% word error rate and 80% Exact Match (EM) respectively. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages.
MphayaNER: Named Entity Recognition for Tshivenda
Mbuvha, Rendani, Adelani, David I., Mutavhatsindi, Tendani, Rakhuhu, Tshimangadzo, Mauda, Aluwani, Maumela, Tshifhiwa Joshua, Masindi, Andisani, Rananga, Seani, Marivate, Vukosi, Marwala, Tshilidzi
Named Entity Recognition (NER) plays a vital role in various Natural Language Processing tasks such as information retrieval, text classification, and question answering. However, NER can be challenging, especially in low-resource languages with limited annotated datasets and tools. This paper adds to the effort of addressing these challenges by introducing MphayaNER, the first Tshivenda NER corpus in the news domain. We establish NER baselines by \textit{fine-tuning} state-of-the-art models on MphayaNER. The study also explores zero-shot transfer between Tshivenda and other related Bantu languages, with chiShona and Kiswahili showing the best results. Augmenting MphayaNER with chiShona data was also found to improve model performance significantly. Both MphayaNER and the baseline models are made publicly available.
Ai APologised to Me? – The Net & U
I had the opportunity to engage in a conversation with an artificial intelligence (AI) program recently. It was fascinating to see how the conversation progressed and how the AI responded to my input. Are you curious about how our conversation went? How can I help you today? AI: Yes, I am familiar with Kiswahili. It is a Bantu language that is spoken primarily in East Africa, particularly in Tanzania, Kenya, Uganda, and the Democratic Republic of Congo.
Phonemic Representation and Transcription for Speech to Text Applications for Under-resourced Indigenous African Languages: The Case of Kiswahili
Awino, Ebbie, Wanzare, Lilian, Muchemi, Lawrence, Wanjawa, Barack, Ombui, Edward, Indede, Florence, McOnyango, Owen, Okal, Benard
Building automatic speech recognition (ASR) systems is a challenging task, especially for under-resourced languages that need to construct corpora nearly from scratch and lack sufficient training data. It has emerged that several African indigenous languages, including Kiswahili, are technologically under-resourced. ASR systems are crucial, particularly for the hearing-impaired persons who can benefit from having transcripts in their native languages. However, the absence of transcribed speech datasets has complicated efforts to develop ASR models for these indigenous languages. This paper explores the transcription process and the development of a Kiswahili speech corpus, which includes both read-out texts and spontaneous speech data from native Kiswahili speakers. The study also discusses the vowels and consonants in Kiswahili and provides an updated Kiswahili phoneme dictionary for the ASR model that was created using the CMU Sphinx speech recognition toolbox, an open-source speech recognition toolkit. The ASR model was trained using an extended phonetic set that yielded a WER and SER of 18.87% and 49.5%, respectively, an improved performance than previous similar research for under-resourced languages.
When Is TTS Augmentation Through a Pivot Language Useful?
Robinson, Nathaniel, Ogayo, Perez, Gangu, Swetha, Mortensen, David R., Watanabe, Shinji
Developing Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data. For many such languages, audio and text are available separately, but not audio with transcriptions. Using text, speech can be synthetically produced via text-to-speech (TTS) systems. However, many low-resource languages do not have quality TTS systems either. We propose an alternative: produce synthetic audio by running text from the target language through a trained TTS system for a higher-resource pivot language. We investigate when and how this technique is most effective in low-resource settings. In our experiments, using several thousand synthetic TTS text-speech pairs and duplicating authentic data to balance yields optimal results. Our findings suggest that searching over a set of candidate pivot languages can lead to marginal improvements and that, surprisingly, ASR performance can by harmed by increases in measured TTS quality. Application of these findings improves ASR by 64.5\% and 45.0\% character error reduction rate (CERR) respectively for two low-resource languages: Guaran\'i and Suba.
Google releases TyDi QA, a data set that aims to capture the uniqueness of languages
Google hopes to spur the development of AI capable of understanding the ways in which languages express different meanings. To this end, company researchers today detailed a data set -- TyDi QA, a question-answering data set covering 11 languages -- inspired by typological diversity, or the notion that different languages express meaning in structurally unique ways. TyDi QA is something of a complement to the English-language Natural Questions corpus Google released last year, and it attempts to capture t he idiosyncrasies and features of tongues like Japanese and Arabic. The researchers point out, for instance, that English changes words to indicate one object ("book") versus many ("books"), and that Arabic has a third form to indicate if there are two of something ("كتابان", kitaban) beyond just singular ("كتاب", kitab) or plural ("كتب", kutub). "Because we selected a set of languages that are typologically distant from each other for this corpus, we expect models performing well on this dataset to generalize across a large number of the languages in the world," wrote Google Research scientist Jonathan Clark in a blog post.