AITopics | kiswahili

Collaborating Authors

kiswahili

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo

Mbogho, Audrey, Awuor, Quin, Kipkebut, Andrew, Wanzare, Lilian, Oloo, Vivian

arXiv.org Artificial IntelligenceJan-19-2025

Natural Language Processing is a crucial frontier in artificial intelligence, with broad applications in many areas, including public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw'ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year, employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording conversations and translation of the resulting text into Kiswahili, thereby creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thus facilitating ongoing contributions and access for developers to train models and develop Natural Language Processing applications. The project demonstrates how grassroots efforts in corpus building can support the inclusion of African languages in artificial intelligence innovations. In addition to filling resource gaps, these corpora are vital in promoting linguistic diversity and empowering local communities by enabling Natural Language Processing applications tailored to their needs. As African countries like Kenya increasingly embrace digital transformation, developing indigenous language resources becomes essential for inclusive growth. We encourage continued collaboration from native speakers and developers to expand and utilize these corpora.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2501.11003

Country:

Africa > South Sudan (0.14)
Africa > Uganda (0.05)
North America > United States (0.04)
(17 more...)

Genre: Research Report (0.50)

Industry:

Health & Medicine (0.67)
Media > News (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Algorithm for Semantic Network Generation from Texts of Low Resource Languages Such as Kiswahili

Wanjawa, Barack Wamkaya, Muchemi, Lawrence, Miriti, Evans

arXiv.org Artificial IntelligenceJan-16-2025

Box 30197 Nairobi 00100, Kenya eamiriti@uonbi.ac.ke Abstract Processing low-resource languages, such as Kiswahili, using machine learning is difficult due to lack of adequate training data. However, such low-resource languages are still important for human communication and are already in daily use and users need practical machine processing tasks such as summarization, disambiguation and even question answering (QA). One method of processing such languages, while bypassing the need for training data, is the use semantic networks. Some low resource languages, such as Kiswahili, are of the subject-verb-object (SVO) structure, and similarly semantic networks are a triple of subject-predicate-object, hence SVO parts of speech tags can map into a semantic network triple. An algorithm to process raw natural language text and map it into a semantic network is therefore necessary and desirable in structuring low resource languages texts. This algorithm tested on the Kiswahili QA task with upto 78.6% exact match. Highlights Languages, both low and high-resource are important for communication. Low resource languages lack vast data repositories necessary for machine learning. Use of language part of speech tags can create meaning from the language. An algorithm can create semantic networks out of the language parts of speech. The semantic network of the language can do practical tasks such as QA.

algorithm, low-resource language, semantic network, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.32591/coas.ojit.0702.01055w

2501.09326

Country:

Africa > Kenya > Nairobi City County > Nairobi (0.25)
North America > United States (0.14)
Oceania > Australia (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

State of NLP in Kenya: A Survey

Amol, Cynthia Jayne, Chimoto, Everlyn Asiko, Gesicho, Rose Delilah, Gitau, Antony M., Etori, Naome A., Kinyanjui, Caringtone, Ndung'u, Steven, Moruye, Lawrence, Ooko, Samson Otieno, Kitonga, Kavengi, Muhia, Brian, Gitau, Catherine, Ndolo, Antony, Wanzare, Lilian D. A., Kahira, Albert Njoroge, Tombe, Ronald

arXiv.org Artificial IntelligenceOct-13-2024

Kenya, known for its linguistic diversity, faces unique challenges and promising opportunities in advancing Natural Language Processing (NLP) technologies, particularly for its underrepresented indigenous languages. This survey provides a detailed assessment of the current state of NLP in Kenya, emphasizing ongoing efforts in dataset creation, machine translation, sentiment analysis, and speech recognition for local dialects such as Kiswahili, Dholuo, Kikuyu, and Luhya. Despite these advancements, the development of NLP in Kenya remains constrained by limited resources and tools, resulting in the underrepresentation of most indigenous languages in digital spaces. This paper uncovers significant gaps by critically evaluating the available datasets and existing NLP models, most notably the need for large-scale language models and the insufficient digital representation of Indigenous languages. We also analyze key NLP applications: machine translation, information retrieval, and sentiment analysis-examining how they are tailored to address local linguistic needs. Furthermore, the paper explores the governance, policies, and regulations shaping the future of AI and NLP in Kenya and proposes a strategic roadmap to guide future research and development efforts. Our goal is to provide a foundation for accelerating the growth of NLP technologies that meet Kenya's diverse linguistic demands.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.09948

Country:

Europe > Finland > Uusimaa > Helsinki (0.05)
Africa > Middle East > Somalia (0.04)
Asia > China (0.04)
(26 more...)

Genre: Overview (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (0.68)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
(2 more...)

Add feedback

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Wanjawa, Barack, Wanzare, Lilian, Indede, Florence, McOnyango, Owen, Ombui, Edward, Muchemi, Lawrence

arXiv.org Artificial IntelligenceJul-8-2023

Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya. Data collection was done by researchers from communities, schools, media, and publishers. The Kencorpus' dataset has a collection of 5,594 items - 4,442 texts (5.6M words) and 1,152 speech files (177hrs). Based on this data, Part of Speech tagging sets for Dholuo and Luhya (50,000 and 93,000 words respectively) were developed. We developed 7,537 Question-Answer pairs for Swahili and created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. We also developed two proof of concept systems: for Kiswahili speech-to-text and machine learning system for Question Answering task, with results of 18.87% word error rate and 80% Exact Match (EM) respectively. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2208.12081

Country:

Africa > East Africa (0.14)
Africa > Kenya > Nairobi City County > Nairobi (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
(20 more...)

Genre: Research Report (1.00)

Industry:

Education (1.00)
Media > News (0.93)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(2 more...)

Add feedback

MphayaNER: Named Entity Recognition for Tshivenda

Mbuvha, Rendani, Adelani, David I., Mutavhatsindi, Tendani, Rakhuhu, Tshimangadzo, Mauda, Aluwani, Maumela, Tshifhiwa Joshua, Masindi, Andisani, Rananga, Seani, Marivate, Vukosi, Marwala, Tshilidzi

arXiv.org Artificial IntelligenceApr-8-2023

Named Entity Recognition (NER) plays a vital role in various Natural Language Processing tasks such as information retrieval, text classification, and question answering. However, NER can be challenging, especially in low-resource languages with limited annotated datasets and tools. This paper adds to the effort of addressing these challenges by introducing MphayaNER, the first Tshivenda NER corpus in the news domain. We establish NER baselines by \textit{fine-tuning} state-of-the-art models on MphayaNER. The study also explores zero-shot transfer between Tshivenda and other related Bantu languages, with chiShona and Kiswahili showing the best results. Augmenting MphayaNER with chiShona data was also found to improve model performance significantly. Both MphayaNER and the baseline models are made publicly available.

information retrieval, mphayaner, natural language, (16 more...)

arXiv.org Artificial Intelligence

2304.03952

Country:

Africa > South Africa > Gauteng (0.15)
Asia > Middle East > UAE (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry: Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Ai APologised to Me? – The Net & U

#artificialintelligenceJan-9-2023, 16:50:20 GMT

I had the opportunity to engage in a conversation with an artificial intelligence (AI) program recently. It was fascinating to see how the conversation progressed and how the AI responded to my input. Are you curious about how our conversation went? How can I help you today? AI: Yes, I am familiar with Kiswahili. It is a Bantu language that is spoken primarily in East Africa, particularly in Tanzania, Kenya, Uganda, and the Democratic Republic of Congo.

artificial intelligence, kenya government, uganda government, (6 more...)

#artificialintelligence

Country:

Africa > Uganda (0.27)
Africa > Tanzania (0.27)
Africa > Kenya (0.27)
(2 more...)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Phonemic Representation and Transcription for Speech to Text Applications for Under-resourced Indigenous African Languages: The Case of Kiswahili

Awino, Ebbie, Wanzare, Lilian, Muchemi, Lawrence, Wanjawa, Barack, Ombui, Edward, Indede, Florence, McOnyango, Owen, Okal, Benard

arXiv.org Artificial IntelligenceOct-29-2022

Building automatic speech recognition (ASR) systems is a challenging task, especially for under-resourced languages that need to construct corpora nearly from scratch and lack sufficient training data. It has emerged that several African indigenous languages, including Kiswahili, are technologically under-resourced. ASR systems are crucial, particularly for the hearing-impaired persons who can benefit from having transcripts in their native languages. However, the absence of transcribed speech datasets has complicated efforts to develop ASR models for these indigenous languages. This paper explores the transcription process and the development of a Kiswahili speech corpus, which includes both read-out texts and spontaneous speech data from native Kiswahili speakers. The study also discusses the vowels and consonants in Kiswahili and provides an updated Kiswahili phoneme dictionary for the ASR model that was created using the CMU Sphinx speech recognition toolbox, an open-source speech recognition toolkit. The ASR model was trained using an extended phonetic set that yielded a WER and SER of 18.87% and 49.5%, respectively, an improved performance than previous similar research for under-resourced languages.

artificial intelligence, machine learning, transcription, (18 more...)

arXiv.org Artificial Intelligence

2210.16537

Country:

Africa > Tanzania > Dar es Salaam Region > Dar es Salaam (0.04)
Africa > Southern Africa (0.04)
Africa > Kenya > Nairobi City County > Nairobi (0.04)
(14 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

When Is TTS Augmentation Through a Pivot Language Useful?

Robinson, Nathaniel, Ogayo, Perez, Gangu, Swetha, Mortensen, David R., Watanabe, Shinji

arXiv.org Artificial IntelligenceJul-20-2022

Developing Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data. For many such languages, audio and text are available separately, but not audio with transcriptions. Using text, speech can be synthetically produced via text-to-speech (TTS) systems. However, many low-resource languages do not have quality TTS systems either. We propose an alternative: produce synthetic audio by running text from the target language through a trained TTS system for a higher-resource pivot language. We investigate when and how this technique is most effective in low-resource settings. In our experiments, using several thousand synthetic TTS text-speech pairs and duplicating authentic data to balance yields optimal results. Our findings suggest that searching over a set of candidate pivot languages can lead to marginal improvements and that, surprisingly, ASR performance can by harmed by increases in measured TTS quality. Application of these findings improves ASR by 64.5\% and 45.0\% character error reduction rate (CERR) respectively for two low-resource languages: Guaran\'i and Suba.

experiment, kiswahili, pivot language, (15 more...)

arXiv.org Artificial Intelligence

2207.09889

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Europe > Finland > Uusimaa > Helsinki (0.05)
South America > Paraguay (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Google releases TyDi QA, a data set that aims to capture the uniqueness of languages

#artificialintelligenceFeb-9-2020, 05:49:20 GMT

Google hopes to spur the development of AI capable of understanding the ways in which languages express different meanings. To this end, company researchers today detailed a data set -- TyDi QA, a question-answering data set covering 11 languages -- inspired by typological diversity, or the notion that different languages express meaning in structurally unique ways. TyDi QA is something of a complement to the English-language Natural Questions corpus Google released last year, and it attempts to capture t he idiosyncrasies and features of tongues like Japanese and Arabic. The researchers point out, for instance, that English changes words to indicate one object ("book") versus many ("books"), and that Arabic has a third form to indicate if there are two of something ("كتابان", kitaban) beyond just singular ("كتاب", kitab) or plural ("كتب", kutub). "Because we selected a set of languages that are typologically distant from each other for this corpus, we expect models performing well on this dataset to generalize across a large number of the languages in the world," wrote Google Research scientist Jonathan Clark in a blog post.

arabic, google release tydi qa, tydi qa, (8 more...)

#artificialintelligence

Technology:

Information Technology > Communications > Social Media (0.59)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.39)

Add feedback