Goto

Collaborating Authors

 South Africa Government


Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP

Marivate, Vukosi, Dzingirai, Isheanesu, Banda, Fiskani, Lastrucci, Richard, Sindane, Thapelo, Madumo, Keabetswe, Olaleye, Kayode, Modupe, Abiodun, Netshifhefhe, Unarine, Combrink, Herkulaas, Nakeng, Mohlatlego, Ledwaba, Matome

arXiv.org Artificial Intelligence

The critical lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. Mafoko addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational Mafoko dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. Mafoko provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa's rich linguistic diversity is represented in the digital age.


South African-born Musk evoked by Trump during meeting with nation's leader: 'Don't want to get Elon involved'

FOX News

President Donald Trump evoked Elon Musk during his Oval Office meeting with South Africa's president on Wednesday, during talks about the ongoing attacks white farmers in the country are facing. Trump went back and forth with President Cyril Ramaphosa over whether what is occurring in South Africa is indeed a "genocide" against white farmers. At one point, during the conversation, a reporter asked Trump how the United States and South Africa might be able to improve their relations. The president said that relations with South Africa are an important matter to him, noting he has several personal friends who are from there, including professional golfers Ernie Els and Retief Goosen, who were present at Tuesday's meeting, and Elon Musk. President Donald Trump and Elon Musk attend a UFC 309 at Madison Square Garden last November. Unprompted, Trump added that while Musk may be a South African native, he doesn't want to "get [him] involved" in the ongoing foreign diplomacy matters that played out during Tuesday's meeting.


CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks

Guo, Jiafeng, Zhou, Changjiang, Zhang, Ruqing, Chen, Jiangui, de Rijke, Maarten, Fan, Yixing, Cheng, Xueqi

arXiv.org Artificial Intelligence

Knowledge-intensive language tasks (KILTs) typically require retrieving relevant documents from trustworthy corpora, e.g., Wikipedia, to produce specific answers. Very recently, a pre-trained generative retrieval model for KILTs, named CorpusBrain, was proposed and reached new state-of-the-art retrieval performance. However, most existing research on KILTs, including CorpusBrain, has predominantly focused on a static document collection, overlooking the dynamic nature of real-world scenarios, where new documents are continuously being incorporated into the source corpus. To address this gap, it is crucial to explore the capability of retrieval models to effectively handle the dynamic retrieval scenario inherent in KILTs. In this work, we first introduce the continual document learning (CDL) task for KILTs and build a novel benchmark dataset named KILT++ based on the original KILT dataset for evaluation. Then, we conduct a comprehensive study over the use of pre-trained CorpusBrain on KILT++. Unlike the promising results in the stationary scenario, CorpusBrain is prone to catastrophic forgetting in the dynamic scenario, hence hampering the retrieval performance. To alleviate this issue, we propose CorpusBrain++, a continual generative pre-training framework. Empirical results demonstrate the significant effectiveness and remarkable efficiency of CorpusBrain++ in comparison to both traditional and generative IR methods.


Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora

Lastrucci, Richard, Dzingirai, Isheanesu, Rajab, Jenalea, Madodonga, Andani, Shingange, Matimba, Njini, Daniel, Marivate, Vukosi

arXiv.org Artificial Intelligence

This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering the South African Government newspaper (Vuk'uzenzele), as well as South African government speeches (ZA-gov-multilingual), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corpora were created to allow researchers to study the language used in South African government publications, with a focus on understanding how South African government officials communicate with their constituents. In this paper we highlight the process of gathering, cleaning and making available the corpora. We create parallel sentence corpora for Neural Machine Translation (NMT) tasks using Language-Agnostic Sentence Representations (LASER) embeddings. With these aligned sentences we then provide NMT benchmarks for 9 indigenous languages by fine-tuning a massively multilingual pre-trained language model.