HausaMovieReview: A Benchmark Dataset for Sentiment Analysis in Low-Resource African Language

Zanga, Asiya Ibrahim, Abdulrahman, Salisu Mamman, Ado, Abubakar, Bichi, Abdulkadir Abubakar, Jibril, Lukman Aliyu, Umar, Abdulmajid Babangida, Adamu, Alhassan, Muhammad, Shamsuddeen Hassan, Abubakar, Bashir Salisu

arXiv.org Artificial Intelligence 

The development of Natural Language Processing (NLP) tools for low-resource languages is critically hindered by the scarcity of annotated datasets. This paper addresses this fundamental challenge by introducing HausaMovieReview, a novel benchmark dataset comprising 5,000 YouTube comments in Hausa and code-switched English. The dataset was meticulously annotated by three independent annotators, demonstrating a robust agreement with a Fleiss' Kappa score of 0.85 between annotators. We used this dataset to conduct a comparative analysis of classical models (Logistic Regression, Decision Tree, K-Nearest Neighbors) and fine-tuned transformer models (BERT and RoBERTa). Our results reveal a key finding: the Decision Tree classifier, with an accuracy and F1-score 89.72% and 89.60% respectively, significantly outperformed the deep learning models. Our findings also provide a robust baseline, demonstrating that effective feature engineering can enable classical models to achieve state-of-the-art performance in low-resource contexts, thereby laying a solid foundation for future research. Keywords: Hausa, Kannywood, Low-Resource Languages, NLP, Sentiment Analysis