AITopics | setswana

Collaborating Authors

setswana

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

Meyer, Francois, Buys, Jan

arXiv.org Artificial IntelligenceNov-20-2025

Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.09197

Country:

Europe (1.00)
Asia (1.00)
Africa (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)

Add feedback

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

Alhanai, Tuka, Kasumovic, Adam, Ghassemi, Mohammad, Zitzelberger, Aven, Lundin, Jessica, Chabot-Couture, Guillaume

arXiv.org Artificial IntelligenceDec-16-2024

Large Language Models (LLMs) have shown remarkable performance across various tasks, yet significant disparities remain for non-English languages, and especially native African languages. This paper addresses these disparities by creating approximately 1 million human-translated words of new benchmark data in 8 low-resource African languages, covering a population of over 160 million speakers of: Amharic, Bambara, Igbo, Sepedi (Northern Sotho), Shona, Sesotho (Southern Sotho), Setswana, and Tsonga. Our benchmarks are translations of Winogrande and three sections of MMLU: college medicine, clinical knowledge, and virology. Using the translated benchmarks, we report previously unknown performance gaps between state-of-the-art (SOTA) LLMs in English and African languages. Finally, using results from over 400 fine-tuned models, we explore several methods to reduce the LLM performance gap, including high-quality dataset fine-tuning (using an LLM-as-an-Annotator), cross-lingual transfer, and cultural appropriateness adjustments. Key findings include average mono-lingual improvements of 5.6% with fine-tuning (with 5.4% average mono-lingual improvements when using high-quality data over low-quality data), 2.9% average gains from cross-lingual transfer, and a 3.0% out-of-the-box performance boost on culturally appropriate questions. The publicly available benchmarks, translations, and code from this study support further research and development aimed at creating more inclusive and effective language technologies.

large language model, machine learning, translation, (19 more...)

arXiv.org Artificial Intelligence

2412.12417

Country:

North America > United States (0.04)
Africa > Niger (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.93)
Health & Medicine (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

BOTS-LM: Training Large Language Models for Setswana

Brown, Nathan, Marivate, Vukosi

arXiv.org Artificial IntelligenceAug-5-2024

In this work we present BOTS-LM, a series of bilingual language models proficient in both Setswana and English. Leveraging recent advancements in data availability and efficient fine-tuning, BOTS-LM achieves performance similar to models significantly larger than itself while maintaining computational efficiency. Our initial release features an 8 billion parameter generative large language model, with upcoming 0.5 billion and 1 billion parameter large language models and a 278 million parameter encoder-only model soon to be released. We find the 8 billion parameter model significantly outperforms Llama-3-70B and Aya 23 on English-Setswana translation tasks, approaching the performance of dedicated machine translation models, while approaching 70B parameter performance on Setswana reasoning as measured by a machine translated subset of the MMLU benchmark. To accompany the BOTS-LM series of language models, we release the largest Setswana web dataset, SetsText, totalling over 267 million tokens. In addition, we release the largest machine translated Setswana dataset, the first and largest synthetic Setswana dataset, training and evaluation code, training logs, and MMLU-tsn, a machine translated subset of MMLU.

setswana, wang, zhang, (16 more...)

arXiv.org Artificial Intelligence

2408.02239

Country:

Africa > Botswana (0.28)
Africa > Zimbabwe (0.28)
Europe > Portugal > Lisbon > Lisbon (0.14)
(23 more...)

Genre: Research Report (0.50)

Industry: Government > Regional Government > Africa Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens

San, Nay, Paraskevopoulos, Georgios, Arora, Aryaman, He, Xiluo, Kaur, Prabhjot, Adams, Oliver, Jurafsky, Dan

arXiv.org Artificial IntelligenceFeb-3-2024

While massively multilingual speech models like wav2vec 2.0 XLSR-128 can be directly fine-tuned for automatic speech recognition (ASR), downstream performance can still be relatively poor on languages that are under-represented in the pre-training data. Continued pre-training on 70-200 hours of untranscribed speech in these languages can help -- but what about languages without that much recorded data? For such cases, we show that supplementing the target language with data from a similar, higher-resource 'donor' language can help. For example, continued pre-training on only 10 hours of low-resource Punjabi supplemented with 60 hours of donor Hindi is almost as good as continued pretraining on 70 hours of Punjabi. By contrast, sourcing data from less similar donors like Bengali does not improve ASR performance. To inform donor language selection, we propose a novel similarity metric based on the sequence distribution of induced acoustic units: the Acoustic Token Distribution Similarity (ATDS). Across a set of typologically different target languages (Punjabi, Galician, Iban, Setswana), we show that the ATDS between the target language and its candidate donors precisely predicts target language ASR performance.

asr performance, punjabi, target language, (13 more...)

arXiv.org Artificial Intelligence

2402.02302

Country:

Africa > South Africa (0.04)
Europe > Spain (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(21 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

PuoBERTa: Training and evaluation of a curated language model for Setswana

Marivate, Vukosi, Mots'Oehli, Moseli, Wagner, Valencia, Lastrucci, Richard, Dzingirai, Isheanesu

arXiv.org Artificial IntelligenceOct-24-2023

Natural language processing (NLP) has made significant progress for well-resourced languages such as English but lagged behind for low-resource languages like Setswana. This paper addresses this gap by presenting PuoBERTa, a customised masked language model trained specifically for Setswana. We cover how we collected, curated, and prepared diverse monolingual texts to generate a high-quality corpus for PuoBERTa's training. Building upon previous efforts in creating monolingual resources for Setswana, we evaluated PuoBERTa across several NLP tasks, including part-of-speech (POS) tagging, named entity recognition (NER), and news categorisation. Additionally, we introduced a new Setswana news categorisation dataset and provided the initial benchmarks using PuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP capabilities for understudied languages like Setswana and paves the way for future research directions.

language model, puoberta, setswana, (16 more...)

arXiv.org Artificial Intelligence

2310.09141

Country:

Africa > Botswana (0.29)
Europe > Germany > Saxony > Leipzig (0.05)
North America > United States > Hawaii (0.04)
(10 more...)

Genre:

Research Report (1.00)
Overview (0.68)

Industry: Government > Regional Government > Africa Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.87)

Add feedback

Benchmarking Neural Machine Translation for Southern African Languages

Martinus, Laura, Abbott, Jade Z.

arXiv.org Machine LearningJun-17-2019

Unlike major Western languages, most African languages are very low-resourced. Furthermore, the resources that do exist are often scattered and difficult to obtain and discover. As a result, the data and code for existing research has rarely been shared. This has lead a struggle to reproduce reported results, and few publicly available benchmarks for African machine translation models exist. To start to address these problems, we trained neural machine translation models for 5 Southern African languages on publicly-available datasets. Code is provided for training the models and evaluate the models on a newly released evaluation set, with the aim of spur future research in the field for Southern African languages.

artificial intelligence, machine translation, natural language, (14 more...)

arXiv.org Machine Learning

1906.10511

Country: Africa (0.22)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

A Focus on Neural Machine Translation for African Languages

Martinus, Laura, Abbott, Jade Z.

arXiv.org Machine LearningJun-14-2019

African languages are numerous, complex and low-resourced. The datasets required for machine translation are difficult to discover, and existing research is hard to reproduce. Minimal attention has been given to machine translation for African languages so there is scant research regarding the problems that arise when using machine translation techniques. To begin addressing these problems, we trained models to translate English to five of the official South African languages (Afrikaans, isiZulu, Northern Sotho, Setswana, Xitsonga), making use of modern neural machine translation techniques. The results obtained show the promise of using neural machine translation techniques for African languages. By providing reproducible publicly-available data, code and results, this research aims to provide a starting point for other researchers in African machine translation to compare to and build upon.

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Machine Learning

1906.05685

Country:

Africa > South Africa > Gauteng > Johannesburg (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa > Southern Africa (0.04)
Africa > South Africa > Western Cape > Cape Town (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback