AITopics | isixhosa

Collaborating Authors

isixhosa

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

Meyer, Francois, Buys, Jan

arXiv.org Artificial IntelligenceNov-20-2025

Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.09197

Country:

Europe (1.00)
Asia (1.00)
Africa (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)

Add feedback

Automatically assessing oral narratives of Afrikaans and isiXhosa children

Louw, Retief, Sharratt, Emma, de Wet, Febe, Jacobs, Christiaan, Smith, Annelien, Kamper, Herman

arXiv.org Artificial IntelligenceJul-21-2025

Developing narrative and comprehension skills in early childhood is critical for later literacy. However, teachers in large preschool classrooms struggle to accurately identify students who require intervention. We present a system for automatically assessing oral narratives of preschool children in Afrikaans and isiXhosa. The system uses automatic speech recognition followed by a machine learning scoring model to predict narrative and comprehension scores. For scoring predicted transcripts, we compare a linear model to a large language model (LLM). The LLM-based system outperforms the linear model in most cases, but the linear system is competitive despite its simplicity. The LLM-based system is comparable to a human expert in flagging children who require intervention. We lay the foundation for automatic oral assessments in classrooms, giving teachers extra capacity to focus on personalised support for children's learning.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.13205

Country: Africa > South Africa (0.15)

Genre: Research Report > New Finding (0.47)

Industry:

Education > Educational Setting (0.68)
Education > Assessment & Standards (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Feature-based analysis of oral narratives from Afrikaans and isiXhosa children

Sharratt, Emma, Smith, Annelien, Louw, Retief, Klop, Daleen, de Wet, Febe, Kamper, Herman

arXiv.org Artificial IntelligenceJul-21-2025

Oral narrative skills are strong predictors of later literacy development. This study examines the features of oral narratives from children who were identified by experts as requiring intervention. Using simple machine learning methods, we analyse recorded stories from four- and five-year-old Afrikaans- and isiXhosa-speaking children. Consistent with prior research, we identify lexical diversity (unique words) and length-based features (mean utterance length) as indicators of typical development, but features like articulation rate prove less informative. Despite cross-linguistic variation in part-of-speech patterns, the use of specific verbs and auxiliaries associated with goal-directed storytelling is correlated with a reduced likelihood of requiring intervention. Our analysis of two linguistically distinct languages reveals both language-specific and shared predictors of narrative proficiency, with implications for early assessment in multilingual contexts.

intervention, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.13164

Country: Africa > South Africa (0.15)

Genre: Research Report > Experimental Study (0.89)

Industry: Education > Educational Setting > K-12 Education > Primary School (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

Rajab, Jenalea, Aremu, Anuoluwapo, Chimoto, Everlyn Asiko, Dunbar, Dale, Morrissey, Graham, Thior, Fadel, Potgieter, Luandrie, Ojo, Jessico, Tonja, Atnafu Lambebo, Chetty, Maushami, Nekoto, Onyothi, Moiloa, Pelonomi, Abbott, Jade, Marivate, Vukosi, Rosman, Benjamin

arXiv.org Artificial IntelligenceFeb-21-2025

This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resources. This framework is supported by the Esethu license, a novel community-centric data license. As a proof of concept, we introduce the Vuk'uzenzele isiXhosa Speech Dataset (ViXSD), an open-source corpus developed under the Esethu Framework and License. The dataset, containing read speech from native isiXhosa speakers enriched with demographic and linguistic metadata, demonstrates how community-driven licensing and curation principles can bridge resource gaps in automatic speech recognition (ASR) for African languages while safeguarding the interests of data creators. We describe the framework guiding dataset development, outline the Esethu license provisions, present the methodology for ViXSD, and present ASR experiments validating ViXSD's usability in building and refining voice-driven applications for isiXhosa.

african language, dataset, license, (11 more...)

arXiv.org Artificial Intelligence

2502.15916

Country:

North America > United States (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Information Technology (0.46)
Government > Regional Government > Africa Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives

Jacobs, Christiaan, Smith, Annelien, Klop, Daleen, Klejch, Ondřej, de Wet, Febe, Kamper, Herman

arXiv.org Artificial IntelligenceJan-11-2025

We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. Oral narratives provide a way to assess children's language development before they learn to read. We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting. Using Whisper and only 5 minutes of transcribed in-domain child speech, we find that additional in-domain adult data (adult speech matching the story domain) provides the biggest improvement, especially when coupled with voice conversion. Semi-supervised learning also helps for both languages, while parameter-efficient fine-tuning helps on Afrikaans but not on isiXhosa (which is under-represented in the Whisper model). Few child-speech studies look at non-English data, and even fewer at the preschool ages of 4 and 5. Our work therefore represents a unique validation of a wide range of previous child-speech ASR strategies in an under-explored setting.

artificial intelligence, machine learning, speech, (15 more...)

arXiv.org Artificial Intelligence

2501.06478

Country: Africa > South Africa (0.14)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context

Matzopoulos, Alexis, Hendriks, Charl, Mahomed, Hishaam, Meyer, Francois

arXiv.org Artificial IntelligenceJan-7-2025

The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, which outperformed models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.03855

Country:

Asia > Middle East (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.72)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation

Meyer, Francois, Buys, Jan

arXiv.org Artificial IntelligenceMar-12-2024

Most data-to-text datasets are for English, so the difficulties of modelling data-to-text for low-resource languages are largely unexplored. In this paper we tackle data-to-text for isiXhosa, which is low-resource and agglutinative. We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG, which presents a new linguistic context that shifts modelling demands to subword-driven techniques. We also develop an evaluation framework for T2X that measures how accurately generated text describes the data. This enables future users of T2X to go beyond surface-level metrics in evaluation. On the modelling side we explore two classes of methods - dedicated data-to-text models trained from scratch and pretrained language models (PLMs). We propose a new dedicated architecture aimed at agglutinative data-to-text, the Subword Segmental Pointer Generator (SSPG). It jointly learns to segment words and copy entities, and outperforms existing dedicated models for 2 agglutinative languages (isiXhosa and Finnish). We investigate pretrained solutions for T2X, which reveals that standard PLMs come up short. Fine-tuning machine translation models emerges as the best method overall. These findings underscore the distinct challenge presented by T2X: neither well-established data-to-text architectures nor customary pretrained methodologies prove optimal. We conclude with a qualitative analysis of generation errors and an ablation study.

computational linguistic, dataset, proceedings, (16 more...)

arXiv.org Artificial Intelligence

2403.07567

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
(24 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Subword Segmental Language Modelling for Nguni Languages

Meyer, Francois, Buys, Jan

arXiv.org Artificial IntelligenceOct-12-2022

Subwords have become the standard units of text in NLP, enabling efficient open-vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is viewed as a preprocessing step applied to the corpus before training. This can lead to sub-optimal segmentations for low-resource languages with complex morphologies. We propose a subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling. By unifying subword segmentation and language modelling, our model learns subwords that optimise LM performance. We train our model on the 4 Nguni languages of South Africa. These are low-resource agglutinative languages, so subword information is critical. As an LM, SSLM outperforms existing approaches such as BPE-based models on average across the 4 languages. Furthermore, it outperforms standard subword segmenters on unsupervised morphological segmentation. We also train our model as a word-level sequence model, resulting in an unsupervised morphological segmenter that outperforms existing methods by a large margin for all 4 languages. Our results show that learning subword segmentation is an effective alternative to existing subword segmenters, enabling the model to discover morpheme-like subwords that improve its LM capabilities.

machine learning, natural language, segmentation, (20 more...)

arXiv.org Artificial Intelligence

2210.06525

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Africa > South Africa > Western Cape > Cape Town (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(13 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

News & Events - Opinion: African 'clicks' outwit artificial intelligence

#artificialintelligenceMay-14-2018, 08:56:10 GMT

Taking African languages into the digital and the fourth industrial ages is our responsibility. We cannot just import technology, such as speech recognition machines, but we should adjust them to our particular environments, writes Professor Tshilidzi Marwala. The Vice-Chancellor and Principal of the University of Johannesburg (UJ) as well as the author of the book Artificial Intelligence for Rational Decision Making, Prof Marwala recently penned an opinion piece, 'African'clicks' outwit artificial intelligence', published by the Sunday Independent, 13 May 2018. IsiXhosa is an interesting language that has over 9 million speakers. It is a language often associated with clicks.

artificial intelligence, isixhosa, machine learning, (14 more...)

#artificialintelligence

Country: Africa > South Africa > Gauteng > Johannesburg (0.27)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.34)

Add feedback

African 'clicks' outwit artificial intelligence Sunday Independent

#artificialintelligenceMay-13-2018, 12:05:44 GMT

IsiXhosa is an interesting language that has over 9 million speakers. It is a language often associated with clicks. Our famous musician, the late Mama Africa, Miriam Makeba, made isiXhosa famous by introducing the Click Song, also called Qongqothwane to the world. Despite the stereotype, isiXhosa is not a clicking language but a Bantu language. Joseph Greenberg, the US linguist classified African languages into four stocks, one of which is the Bantu language that is spoken from Tanzania to South Africa.

artificial intelligence, isixhosa, machine learning, (12 more...)

#artificialintelligence

Country:

Africa > South Africa (0.39)
Africa > Tanzania (0.25)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.34)

Add feedback