Goto

Collaborating Authors

 Brazzaville


Enhancing LLM Watermark Resilience Against Both Scrubbing and Spoofing Attacks

Shen, Huanming, Huang, Baizhou, Wan, Xiaojun

arXiv.org Artificial Intelligence

Watermarking is a promising defense against the misuse of large language models (LLMs), yet it remains vulnerable to scrubbing and spoofing attacks. This vulnerability stems from an inherent trade-off governed by watermark window size: smaller windows resist scrubbing better but are easier to reverse-engineer, enabling low-cost statistics-based spoofing attacks. This work breaks this trade-off by introducing a novel mechanism, equivalent texture keys, where multiple tokens within a watermark window can independently support the detection. Based on the redundancy, we propose a novel watermark scheme with Sub-vocabulary decomposed Equivalent tExture Key (SEEK). It achieves a Pareto improvement, increasing the resilience against scrubbing attacks without compromising robustness to spoofing. Experiments demonstrate SEEK's superiority over prior method, yielding spoofing robustness gains of +88.2%/+92.3%/+82.0% and scrubbing robustness gains of +10.2%/+6.4%/+24.6% across diverse dataset settings.


Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models

Simbeck, Katharina, Mahran, Mariam

arXiv.org Artificial Intelligence

Despite growing research on bias in large language models (LLMs), most work has focused on gender and race, with little attention to religious identity. This paper explores how religion is internally represented in LLMs and how it intersects with concepts of violence and geography. Using mechanistic interpretability and Sparse Autoencoders (SAEs) via the Neuronpedia API, we analyze latent feature activations across five models. We measure overlap between religion- and violence-related prompts and probe semantic patterns in activation contexts. While all five religions show comparable internal cohesion, Islam is more frequently linked to features associated with violent language. In contrast, geographic associations largely reflect real-world religious demographics, revealing how models embed both factual distributions and cultural stereotypes. These findings highlight the value of structural analysis in auditing not just outputs but also internal representations that shape model behavior.


The State of Large Language Models for African Languages: Progress and Challenges

Hussen, Kedir Yassin, Sewunetie, Walelign Tewabe, Ayele, Abinew Ali, Imam, Sukairaj Hafiz, Muhammad, Shamsuddeen Hassan, Yimam, Seid Muhie

arXiv.org Artificial Intelligence

The rapid progress of Large Language Models (LLMs) has transformed the field of Natural Language Processing (NLP). However, these advancements have primarily concentrated on high-resource languages, leaving many low-resource languages, particularly African languages, largely overlooked. Africa has over 2,000 languages [Ethnologue, 2025], the majority of which face significant challenges such as a lack of data, limited computational resources, insufficient NLP tools, and the absence of standardized benchmarks. This study presents a three-stage review to evaluate LLMs' current status, challenges, and prospects for African languages. The first stage investigates both commercial and open-source LLMs models with more than 7 billion parameters regarding their support for African languages [Wang et al., 2024]. The second stage examines foundational multilingual models that have significantly influenced NLP research and development.


Predictors of disease outbreaks at continentalscale in the African region: Insights and predictions with geospatial artificial intelligence using earth observations and routine disease surveillance data

Pezanowski, Scott, Koua, Etien Luc, Okeibunor, Joseph C, Gueye, Abdou Salam

arXiv.org Artificial Intelligence

Objectives: Our research adopts computational techniques to analyze disease outbreaks weekly over a large geographic area while maintaining local-level analysis by incorporating relevant high-spatial resolution cultural and environmental datasets. The abundance of data about disease outbreaks gives scientists an excellent opportunity to uncover patterns in disease spread and make future predictions. However, data over a sizeable geographic area quickly outpace human cognition. Our study area covers a significant portion of the African continent (about 17,885,000 km2). The data size makes computational analysis vital to assist human decision-makers. Methods: We first applied global and local spatial autocorrelation for malaria, cholera, meningitis, and yellow fever case counts. We then used machine learning to predict the weekly presence of these diseases in the second-level administrative district. Lastly, we used machine learning feature importance methods on the variables that affect spread. Results: Our spatial autocorrelation results show that geographic nearness is critical but varies in effect and space. Moreover, we identified many interesting hot and cold spots and spatial outliers. The machine learning model infers a binary class of cases or none with the best F1 score of 0.96 for malaria. Machine learning feature importance uncovered critical cultural and environmental factors affecting outbreaks and variations between diseases. Conclusions: Our study shows that data analytics and machine learning are vital to understanding and monitoring disease outbreaks locally across vast areas. The speed at which these methods produce insights can be critical during epidemics and emergencies.


Few-shot Open Relation Extraction with Gaussian Prototype and Adaptive Margin

Guo, Tianlin, Zhang, Lingling, Wang, Jiaxin, Lei, Yuokuo, Li, Yifei, Wang, Haofen, Liu, Jun

arXiv.org Artificial Intelligence

Few-shot relation extraction with none-of-the-above (FsRE with NOTA) aims at predicting labels in few-shot scenarios with unknown classes. FsRE with NOTA is more challenging than the conventional few-shot relation extraction task, since the boundaries of unknown classes are complex and difficult to learn. Meta-learning based methods, especially prototype-based methods, are the mainstream solutions to this task. They obtain the classification boundary by learning the sample distribution of each class. However, their performance is limited because few-shot overfitting and NOTA boundary confusion lead to misclassification between known and unknown classes. To this end, we propose a novel framework based on Gaussian prototype and adaptive margin named GPAM for FsRE with NOTA, which includes three modules, semi-factual representation, GMM-prototype metric learning and decision boundary learning. The first two modules obtain better representations to solve the few-shot problem through debiased information enhancement and Gaussian space distance measurement. The third module learns more accurate classification boundaries and prototypes through adaptive margin and negative sampling. In the training procedure of GPAM, we use contrastive learning loss to comprehensively consider the effects of range and margin on the classification of known and unknown classes to ensure the model's stability and robustness. Sufficient experiments and ablations on the FewRel dataset show that GPAM surpasses previous prototype methods and achieves state-of-the-art performance.


Modelling and characterization of fine Particulate Matter dynamics in Bujumbura using low cost sensors

Ndamuzi, Egide, Akimana, Rachel, Gahungu, Paterne, Bimenyimana, Elie

arXiv.org Machine Learning

Air pollution is a result of multiple sources including both natural and anthropogenic activities. The rapid urbanization of the cities such as Bujumbura economic capital of Burundi, is one of these factors. The very first characterization of the spatio-temporal variability of PM2.5 in Bujumbura and the forecasting of PM2.5 concentration have been conducted in this paper using data collected during a year, from august 2022 to august 2023, by low cost sensors installed in Bujumbura city. For each commune, an hourly, daily and seasonal analysis were carried out and the results showed that the mass concentrations of PM2.5 in the three municipalities differ from one commune to another. The average hourly and annual PM2.5 concentrations exceed the World Health Organization standards. The range is between 28.3 and 35.0 microgram/m3 . In order to make prediction of PM2.5 concentration, an investigation of RNN with Long Short Term Memory (LSTM) has been undertaken.


Contextualising Levels of Language Resourcedness affecting Digital Processing of Text

Keet, C. Maria, Khumalo, Langa

arXiv.org Artificial Intelligence

Application domains such as digital humanities and tool like chatbots involve some form of processing natural language, from digitising hardcopies to speech generation. The language of the content is typically characterised as either a low resource language (LRL) or high resource language (HRL), also known as resource-scarce and well-resourced languages, respectively. African languages have been characterized as resource-scarce languages (Bosch et al. 2007; Pretorius & Bosch 2003; Keet & Khumalo 2014) and English is by far the most well-resourced language. Varied language resources are used to develop software systems for these languages to accomplish a wide range of tasks. In this paper we argue that the dichotomous typology LRL and HRL for all languages is problematic. Through a clear understanding of language resources situated in a society, a matrix is developed that characterizes languages as Very LRL, LRL, RL, HRL and Very HRL. The characterization is based on the typology of contextual features for each category, rather than counting tools, and motivation is provided for each feature and each characterization. The contextualisation of resourcedness, with a focus on African languages in this paper, and an increased understanding of where on the scale the language used in a project is, may assist in, among others, better planning of research and implementation projects. We thus argue in this paper that the characterization of language resources within a given scale in a project is an indispensable component particularly in the context of low-resourced languages.


Flickr Africa: Examining Geo-Diversity in Large-Scale, Human-Centric Visual Data

Naggita, Keziah, LaChance, Julienne, Xiang, Alice

arXiv.org Artificial Intelligence

Biases in large-scale image datasets are known to influence the performance of computer vision models as a function of geographic context. To investigate the limitations of standard Internet data collection methods in low- and middle-income countries, we analyze human-centric image geo-diversity on a massive scale using geotagged Flickr images associated with each nation in Africa. We report the quantity and content of available data with comparisons to population-matched nations in Europe as well as the distribution of data according to fine-grained intra-national wealth estimates. Temporal analyses are performed at two-year intervals to expose emerging data trends. Furthermore, we present findings for an ``othering'' phenomenon as evidenced by a substantial number of images from Africa being taken by non-local photographers. The results of our study suggest that further work is required to capture image data representative of African people and their environments and, ultimately, to improve the applicability of computer vision models in a global context.


Question Answering and Question Generation for Finnish

Kylliäinen, Ilmari, Yangarber, Roman

arXiv.org Artificial Intelligence

Recent advances in the field of language modeling have improved the state-of-the-art in question answering (QA) and question generation (QG). However, the development of modern neural models, their benchmarks, and datasets for training them has mainly focused on English. Finnish, like many other languages, faces a shortage of large QA/QG model training resources, which has prevented experimenting with state-of-the-art QA/QG fine-tuning methods. We present the first neural QA and QG models that work with Finnish. To train the models, we automatically translate the SQuAD dataset and then use normalization methods to reduce the amount of problematic data created during the translation. Using the synthetic data, together with the Finnish partition of the TyDi-QA dataset, we fine-tune several transformer-based models to both QA and QG and evaluate their performance. To the best of our knowledge, the resulting dataset is the first large-scale QA/QG resource for Finnish. This paper also sets the initial benchmarks for Finnish-language QA and QG.


AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

Dossou, Bonaventure F. P., Tonja, Atnafu Lambebo, Yousuf, Oreen, Osei, Salomey, Oppong, Abigail, Shode, Iyanuoluwa, Awoyomi, Oluwabusayo Olufunke, Emezue, Chris Chinenye

arXiv.org Artificial Intelligence

In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.