Omotayo, Abdul-Hakeem
Text Categorization Can Enhance Domain-Agnostic Stopword Extraction
Turki, Houcemeddine, Etori, Naome A., Taieb, Mohamed Ali Hadj, Omotayo, Abdul-Hakeem, Emezue, Chris Chinenye, Aouicha, Mohamed Ben, Awokoya, Ayodele, Lawan, Falalu Ibrahim, Nixdorf, Doreen
This paper investigates the role of text categorization in streamlining stopword extraction in natural language processing (NLP), specifically focusing on nine African languages alongside French. By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages. Nevertheless, linguistic variances result in lower detection rates for certain languages. Interestingly, we find that while over 40% of stopwords are common across news categories, less than 15% are unique to a single category. Uncommon stopwords add depth to text but their classification as stopwords depends on context. Therefore combining statistical and linguistic approaches creates comprehensive stopword lists, highlighting the value of our hybrid method. This research enhances NLP for African languages and underscores the importance of text categorization in stopword extraction.
AfriMTE and AfriCOMET: Empowering COMET to Embrace Under-resourced African Languages
Wang, Jiayi, Adelani, David Ifeoluwa, Agrawal, Sweta, Rei, Ricardo, Briakou, Eleftheria, Carpuat, Marine, Masiak, Marek, He, Xuanli, Bourhim, Sofia, Bukula, Andiswa, Mohamed, Muhidin, Olatoye, Temitayo, Mokayede, Hamam, Mwase, Christine, Kimotho, Wangui, Yuehgoh, Foutse, Aremu, Anuoluwapo, Ojo, Jessica, Muhammad, Shamsuddeen Hassan, Osei, Salomey, Omotayo, Abdul-Hakeem, Chukwuneke, Chiamaka, Ogayo, Perez, Hourrane, Oumaima, Anigri, Salma El, Ndolela, Lolwethu, Mangwana, Thabiso, Mohamed, Shafie Abdi, Hassan, Ayinde, Awoyomi, Oluwabusayo Olufunke, Alkhaled, Lama, Al-Azzawi, Sana, Etori, Naome A., Ochieng, Millicent, Siro, Clemencia, Njoroge, Samuel, Muchiri, Eric, Kimotho, Wangari, Momo, Lyse Naomi Wamba, Abolade, Daud, Ajao, Simbiat, Adewumi, Tosin, Shode, Iyanuoluwa, Macharm, Ricky, Iro, Ruqayya Nasir, Abdullahi, Saheed S., Moore, Stephen E., Opoku, Bernard, Akinjobi, Zainab, Afolabi, Abeeb, Obiefuna, Nnaemeka, Ogbu, Onyekachi Raphael, Brian, Sam, Otiende, Verrah Akinyi, Mbonu, Chinedu Emmanuel, Sari, Sakayo Toadoum, Stenetorp, Pontus
Despite the progress we have recorded in scaling multilingual machine translation (MT) models and evaluation data to several under-resourced African languages, it is difficult to measure accurately the progress we have made on these languages because evaluation is often performed on n-gram matching metrics like BLEU that often have worse correlation with human judgments. Embedding-based metrics such as COMET correlate better; however, lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with a simplified MQM guideline for error-span annotation and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET, a COMET evaluation metric for African languages by leveraging DA training data from high-resource languages and African-centric multilingual encoder (AfroXLM-Roberta) to create the state-of-the-art evaluation metric for African languages MT with respect to Spearman-rank correlation with human judgments (+0.406).
MasakhaNEWS: News Topic Classification for African languages
Adelani, David Ifeoluwa, Masiak, Marek, Azime, Israel Abebe, Alabi, Jesujoba, Tonja, Atnafu Lambebo, Mwase, Christine, Ogundepo, Odunayo, Dossou, Bonaventure F. P., Oladipo, Akintunde, Nixdorf, Doreen, Emezue, Chris Chinenye, al-azzawi, sana, Sibanda, Blessing, David, Davis, Ndolela, Lolwethu, Mukiibi, Jonathan, Ajayi, Tunde, Moteu, Tatiana, Odhiambo, Brian, Owodunni, Abraham, Obiefuna, Nnaemeka, Mohamed, Muhidin, Muhammad, Shamsuddeen Hassan, Ababu, Teshome Mulugeta, Salahudeen, Saheed Abdullahi, Yigezu, Mesay Gemeda, Gwadabe, Tajuddeen, Abdulmumin, Idris, Taye, Mahlet, Awoyomi, Oluwabusayo, Shode, Iyanuoluwa, Adelani, Tolulope, Abdulganiyu, Habiba, Omotayo, Abdul-Hakeem, Adeeko, Adetola, Afolabi, Abeeb, Aremu, Anuoluwapo, Samuel, Olanrewaju, Siro, Clemencia, Kimotho, Wangari, Ogbu, Onyekachi, Mbonu, Chinedu, Chukwuneke, Chiamaka, Fanijo, Samuel, Ojo, Jessica, Awosan, Oyinkansola, Kebede, Tadesse, Sakayo, Toadoum Sari, Nyatsine, Pamela, Sidume, Freedmore, Yousuf, Oreen, Oduwole, Mardiyyah, Tshinu, Tshinu, Kimanuka, Ussen, Diko, Thina, Nxakama, Siyanda, Nigusse, Sinodos, Johar, Abdulmejid, Mohamed, Shafie, Hassan, Fuad Mire, Mehamed, Moges Ahmed, Ngabire, Evrard, Jules, Jules, Ssenkungu, Ivan, Stenetorp, Pontus
African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
Adapting to the Low-Resource Double-Bind: Investigating Low-Compute Methods on Low-Resource African Languages
Leong, Colin, Shandilya, Herumb, Dossou, Bonaventure F. P., Tonja, Atnafu Lambebo, Mathew, Joel, Omotayo, Abdul-Hakeem, Yousuf, Oreen, Akinjobi, Zainab, Emezue, Chris Chinenye, Muhammad, Shamsudeen, Kolawole, Steven, Choi, Younwoo, Adewumi, Tosin
Many natural language processing (NLP) tasks make use of massively pre-trained language models, which are computationally expensive. However, access to high computational resources added to the issue of data scarcity of African languages constitutes a real barrier to research experiments on these languages. In this work, we explore the applicability of low-compute approaches such as language adapters in the context of this low-resource double-bind. We intend to answer the following question: do language adapters allow those who are doubly bound by data and compute to practically build useful models? Through fine-tuning experiments on African languages, we evaluate their effectiveness as cost-effective approaches to low-resource African NLP. Using solely free compute resources, our results show that language adapters achieve comparable performances to massive pre-trained language models which are heavy on computational resources. This opens the door to further experimentation and exploration on full-extent of language adapters capacities.