Automatic Correction of Writing Anomalies in Hausa Texts
Wali, Ahmad Mustapha, Nisioi, Sergiu
–arXiv.org Artificial Intelligence
Hausa texts are often characterized by writing anomalies such as incorrect character substitutions and spacing errors, which sometimes hinder natural language processing (NLP) applications. This paper presents an approach to automatically correct the anomalies by finetuning transformer-based models. Using a corpus gathered from several public sources, we created a large-scale parallel dataset of over 450,000 noisy-clean Hausa sentence pairs by introducing synthetically generated noise, fine-tuned to mimic realistic writing errors. Moreover, we adapted several multilingual and African language-focused models, including M2M100, AfriTEVA, mBART, and Opus-MT variants for this correction task using SentencePiece tokenization. Our experimental results demonstrate significant increases in F1, BLEU and METEOR scores, as well as reductions in Character Error Rate (CER) and Word Error Rate (WER). This research provides a robust methodology, a publicly available dataset, and effective models to improve Hausa text quality, thereby advancing NLP capabilities for the language and offering transferable insights for other low-resource languages.
arXiv.org Artificial Intelligence
Jun-5-2025
- Country:
- Africa
- Benin (0.04)
- Ghana (0.04)
- Middle East (0.04)
- Niger (0.04)
- Nigeria > Kebbi State
- Birnin Kebbi (0.05)
- West Africa (0.04)
- Asia
- China (0.04)
- Indonesia > Bali (0.04)
- Middle East (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Middle East (0.04)
- Moldova > Transnistria (0.05)
- Portugal > Lisbon
- Lisbon (0.04)
- Romania > București - Ilfov Development Region
- Municipality of Bucharest > Bucharest (0.04)
- Belgium > Brussels-Capital Region
- North America > United States (1.00)
- Africa
- Genre:
- Research Report > New Finding (0.48)
- Technology: