Goto

Collaborating Authors

 wolof


WolBanking77: Wolof Banking Speech Intent Classification Dataset

Kandji, Abdou Karim, Precioso, Frédéric, Ba, Cheikh, Ndiaye, Samba, Ndione, Augustin

arXiv.org Artificial Intelligence

Intent classification models have made a significant progress in recent years. However, previous studies primarily focus on high-resource language datasets, which results in a gap for low-resource languages and for regions with high rates of illiteracy, where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90\% of the population, while the national illiteracy rate remains at of 42\%. Wolof is actually spoken by more than 10 million people in West African region. To address these limitations, we introduce the Wolof Banking Speech Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. In addition, this paper presents an in-depth examination of the dataset's contents. We report baseline F1-scores and word error rates metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. Dataset and code available at: https://github.com/abdoukarim/wolbanking77.


Sentiment Analysis on the young people's perception about the mobile Internet costs in Senegal

Mbaye, Derguene, Seye, Madoune Robert, Diallo, Moussa, Ndiaye, Mamadou Lamine, Sow, Djiby, Adjanohoun, Dimitri Samuel, Mbengue, Tatiana, Wade, Cheikh Samba, Pablo, De Roulet, Munyaka, Jean-Claude Baraka, Chenal, Jerome

arXiv.org Artificial Intelligence

Internet penetration rates in Africa are rising steadily, and mobile Internet is getting an even bigger boost with the availability of smartphones. Young people are increasingly using the Internet, especially social networks, and Senegal is no exception to this revolution. Social networks have become the main means of expression for young people. Despite this evolution in Internet access, there are few operators on the market, which limits the alternatives available in terms of value for money. In this paper, we will look at how young people feel about the price of mobile Internet in Senegal, in relation to the perceived quality of the service, through their comments on social networks. We scanned a set of Twitter and Facebook comments related to the subject and applied a sentiment analysis model to gather their general feelings.


Task-Oriented Dialog Systems for the Senegalese Wolof Language

Mbaye, Derguene, Diallo, Moussa

arXiv.org Artificial Intelligence

In recent years, we are seeing considerable interest in conversational agents with the rise of large language models (LLMs). Although they offer considerable advantages, LLMs also present significant risks, such as hallucination, which hinder their widespread deployment in industry. Moreover, low-resource languages such as African ones are still underrepresented in these systems limiting their performance in these languages. In this paper, we illustrate a more classical approach based on modular architectures of Task-oriented Dialog Systems (ToDS) offering better control over outputs. We propose a chatbot generation engine based on the Rasa framework and a robust methodology for projecting annotations onto the Wolof language using an in-house machine translation system. After evaluating a generated chatbot trained on the Amazon Massive dataset, our Wolof Intent Classifier performs similarly to the one obtained for French, which is a resource-rich language. We also show that this approach is extensible to other low-resource languages, thanks to the intent classifier's language-agnostic pipeline, simplifying the design of chatbots in these languages.


Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal

Gauthier, Elodie, Ndiaye, Aminata, Guissé, Abdoulaye

arXiv.org Artificial Intelligence

This work is part of the Kallaama project, whose objective is to produce and disseminate national languages corpora for speech technologies developments, in the field of agriculture. Except for Wolof, which benefits from some language data for natural language processing, national languages of Senegal are largely ignored by language technology providers. However, such technologies are keys to the protection, promotion and teaching of these languages. Kallaama focuses on the 3 main spoken languages by Senegalese people: Wolof, Pulaar and Sereer. These languages are widely spoken by the population, with around 10 million of native Senegalese speakers, not to mention those outside the country. However, they remain under-resourced in terms of machine-readable data that can be used for automatic processing and language technologies, all the more so in the agricultural sector. We release a transcribed speech dataset containing 125 hours of recordings, about agriculture, in each of the above-mentioned languages. These resources are specifically designed for Automatic Speech Recognition purpose, including traditional approaches. To build such technologies, we provide textual corpora in Wolof and Pulaar, and a pronunciation lexicon containing 49,132 entries from the Wolof dataset.


Preuve de concept d'un bot vocal dialoguant en wolof

Gauthier, Elodie, Wade, Papa-Séga, Moudenc, Thierry, Collen, Patrice, De Neef, Emilie, Ba, Oumar, Cama, Ndeye Khoyane, Kebe, Cheikh Ahmadou Bamba, Gningue, Ndeye Aissatou, Aristide, Thomas Mendo'o

arXiv.org Artificial Intelligence

This paper presents the proof-of-concept of the first automatic voice assistant ever built in Wolof language, the main vehicular language spoken in Senegal. This voicebot is the result of a collaborative research project between Orange Innovation in France, Orange Senegal (aka Sonatel) and ADNCorp, a small IT company based in Dakar, Senegal. The purpose of the voicebot is to provide information to Orange customers about the Sargal loyalty program of Orange Senegal by using the most natural mean to communicate: speech. The voicebot receives in input the customer's oral request that is then processed by a SLU system to reply to the customer's request using audio recordings. The first results of this proof-of-concept are encouraging as we achieved 22\% of WER for the ASR task and 78\% of F1-score on the NLU task.


MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

Dione, Cheikh M. Bamba, Adelani, David, Nabende, Peter, Alabi, Jesujoba, Sindane, Thapelo, Buzaaba, Happy, Muhammad, Shamsuddeen Hassan, Emezue, Chris Chinenye, Ogayo, Perez, Aremu, Anuoluwapo, Gitau, Catherine, Mbaye, Derguene, Mukiibi, Jonathan, Sibanda, Blessing, Dossou, Bonaventure F. P., Bukula, Andiswa, Mabuya, Rooweither, Tapo, Allahsera Auguste, Munkoh-Buabeng, Edwin, Koagne, victoire Memdjokam, Kabore, Fatoumata Ouoba, Taylor, Amelia, Kalipe, Godson, Macucwa, Tebogo, Marivate, Vukosi, Gwadabe, Tajuddeen, Elvis, Mboning Tchiaze, Onyenwe, Ikechukwu, Atindogbe, Gratien, Adelani, Tolulope, Akinade, Idris, Samuel, Olanrewaju, Nahimana, Marien, Musabeyezu, Théogène, Niyomutabazi, Emile, Chimhenga, Ester, Gotosa, Kudzai, Mizha, Patrick, Agbolo, Apelete, Traore, Seydou, Uchechukwu, Chinedu, Yusuf, Aliyu, Abdullahi, Muhammad, Klakow, Dietrich

arXiv.org Artificial Intelligence

In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in UD. Evaluating on the MasakhaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with cross-lingual parameter-efficient fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems more effective for POS tagging in unseen languages.


Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling Corrector

Mbaye, Derguene, Diallo, Moussa

arXiv.org Artificial Intelligence

The progress of Natural Language Processing (NLP), although fast in recent years, is not at the same pace for all languages. African languages in particular are still behind and lack automatic processing tools. Some of these tools are very important for the development of these languages but also have an important role in many NLP applications. This is particularly the case for automatic spell checkers. Several approaches have been studied to address this task and the one modeling spelling correction as a translation task from misspelled (noisy) text to well-spelled (correct) text shows promising results. However, this approach requires a parallel corpus of noisy data on the one hand and correct data on the other hand, whereas Wolof is a low-resource language and does not have such a corpus. In this paper, we present a way to address the constraint related to the lack of data by generating synthetic data and we present sequence-to-sequence models using Deep Learning for spelling correction in Wolof. We evaluated these models in three different scenarios depending on the subwording method applied to the data and showed that the latter had a significant impact on the performance of the models, which opens the way for future research in Wolof spelling correction.


Low-Resourced Machine Translation for Senegalese Wolof Language

Mbaye, Derguene, Diallo, Moussa, Diop, Thierno Ibrahima

arXiv.org Artificial Intelligence

Natural Language Processing (NLP) research has made great advancements in recent years with major breakthroughs that have established new benchmarks. However, these advances have mainly benefited a certain group of languages commonly referred to as resource-rich such as English and French. Majority of other languages with weaker resources are then left behind which is the case for most African languages including Wolof. In this work, we present a parallel Wolof/French corpus of 123,000 sentences on which we conducted experiments on machine translation models based on Recurrent Neural Networks (RNN) in different data configurations. We noted performance gains with the models trained on subworded data as well as those trained on the French-English language pair compared to those trained on the French-Wolof pair under the same experimental conditions.


Investigating data partitioning strategies for crosslinguistic low-resource ASR evaluation

Liu, Zoey, Spence, Justin, Prud'hommeaux, Emily

arXiv.org Artificial Intelligence

Many automatic speech recognition (ASR) data sets include a single pre-defined test set consisting of one or more speakers whose speech never appears in the training set. This "hold-speaker(s)-out" data partitioning strategy, however, may not be ideal for data sets in which the number of speakers is very small. This study investigates ten different data split methods for five languages with minimal ASR training resources. We find that (1) model performance varies greatly depending on which speaker is selected for testing; (2) the average word error rate (WER) across all held-out speakers is comparable not only to the average WER over multiple random splits but also to any given individual random split; (3) WER is also generally comparable when the data is split heuristically or adversarially; (4) utterance duration and intensity are comparatively more predictive factors of variability regardless of the data split. These results suggest that the widely used hold-speakers-out approach to ASR data partitioning can yield results that do not reflect model performance on unseen data or speakers. Random splits can yield more reliable and generalizable estimates when facing data sparsity.


MasakhaNER: Named Entity Recognition for African Languages

Adelani, David Ifeoluwa, Abbott, Jade, Neubig, Graham, D'souza, Daniel, Kreutzer, Julia, Lignos, Constantine, Palen-Michel, Chester, Buzaaba, Happy, Rijhwani, Shruti, Ruder, Sebastian, Mayhew, Stephen, Azime, Israel Abebe, Muhammad, Shamsuddeen, Emezue, Chris Chinenye, Nakatumba-Nabende, Joyce, Ogayo, Perez, Aremu, Anuoluwapo, Gitau, Catherine, Mbaye, Derguene, Alabi, Jesujoba, Yimam, Seid Muhie, Gwadabe, Tajuddeen, Ezeani, Ignatius, Niyongabo, Rubungo Andre, Mukiibi, Jonathan, Otiende, Verrah, Orife, Iroro, David, Davis, Ngom, Samba, Adewumi, Tosin, Rayson, Paul, Adeyemi, Mofetoluwa, Muriuki, Gerald, Anebi, Emmanuel, Chukwuneke, Chiamaka, Odu, Nkiruka, Wairagala, Eric Peter, Oyerinde, Samuel, Siro, Clemencia, Bateesa, Tobius Saul, Oloyede, Temilola, Wambui, Yvonne, Akinode, Victor, Nabagereka, Deborah, Katusiime, Maurice, Awokoya, Ayodele, MBOUP, Mouhamadane, Gebreyohannes, Dibora, Tilaye, Henok, Nwaike, Kelechi, Wolde, Degaga, Faye, Abdoulaye, Sibanda, Blessing, Ahia, Orevaoghene, Dossou, Bonaventure F. P., Ogueji, Kelechi, DIOP, Thierno Ibrahima, Diallo, Abdoulaye, Akinfaderin, Adewale, Marengereke, Tendai, Osei, Salomey

arXiv.org Artificial Intelligence

We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.