Kano
Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?
Abdulmumin, Idris, Galadanci, Bashir Shehu, Aliyu, Garba, Muhammad, Shamsuddeen Hassan
Monolingual data, being readily available in large quantities, has been used to upscale the scarcely available parallel data to train better models for automatic translation. Self-learning, where a model is made to learn from its output, is one approach to exploit such data. However, it has been shown that too much of this data can be detrimental to the performance of the model if the available parallel data is comparatively extremely low. In this study, we investigate whether the monolingual data can also be too little and if this reduction, based on quality, has any effect on the performance of the translation model. Experiments have shown that on English-German low-resource NMT, it is often better to select only the most useful additional data, based on quality or closeness to the domain of the test data, than utilizing all of the available data.
Active Sensing of Knee Osteoarthritis Progression with Reinforcement Learning
Nguyen, Khanh, Nguyen, Huy Hoang, Panfilov, Egor, Tiulpin, Aleksei
Osteoarthritis (OA) is the most common musculoskeletal disease, which has no cure. Knee OA (KOA) is one of the highest causes of disability worldwide, and it costs billions of United States dollars to the global community. Prediction of KOA progression has been of high interest to the community for years, as it can advance treatment development through more efficient clinical trials and improve patient outcomes through more efficient healthcare utilization. Existing approaches for predicting KOA, however, are predominantly static, i.e. consider data from a single time point to predict progression many years into the future, and knee level, i.e. consider progression in a single joint only. Due to these and related reasons, these methods fail to deliver the level of predictive performance, which is sufficient to result in cost savings and better patient outcomes. Collecting extensive data from all patients on a regular basis could address the issue, but it is limited by the high cost at a population level. In this work, we propose to go beyond static prediction models in OA, and bring a novel Active Sensing (AS) approach, designed to dynamically follow up patients with the objective of maximizing the number of informative data acquisitions, while minimizing their total cost over a period of time. Our approach is based on Reinforcement Learning (RL), and it leverages a novel reward function designed specifically for AS of disease progression in more than one part of a human body. Our method is end-to-end, relies on multi-modal Deep Learning, and requires no human input at inference time. Throughout an exhaustive experimental evaluation, we show that using RL can provide a higher monetary benefit when compared to state-of-the-art baselines.
Artificial Intelligence for Public Health Surveillance in Africa: Applications and Opportunities
Tshimula, Jean Marie, Kalengayi, Mitterrand, Makenga, Dieumerci, Lilonge, Dorcas, Asumani, Marius, Madiya, Déborah, Kalonji, Élie Nkuba, Kanda, Hugues, Galekwa, René Manassé, Kumbu, Josias, Mikese, Hardy, Tshimula, Grace, Muabila, Jean Tshibangu, Mayemba, Christian N., Nkashama, D'Jeff K., Kalala, Kalonji, Ataky, Steve, Basele, Tighana Wenge, Didier, Mbuyi Mukendi, Kasereka, Selain K., Dialufuma, Maximilien V., Kumwita, Godwill Ilunga Wa, Muyuku, Lionel, Kimpesa, Jean-Paul, Muteba, Dominique, Abedi, Aaron Aruna, Ntobo, Lambert Mukendi, Bundutidi, Gloria M., Mashinda, Désiré Kulimba, Mpinga, Emmanuel Kabengele, Kasoro, Nathanaël M.
Artificial Intelligence (AI) is revolutionizing various fields, including public health surveillance. In Africa, where health systems frequently encounter challenges such as limited resources, inadequate infrastructure, failed health information systems and a shortage of skilled health professionals, AI offers a transformative opportunity. This paper investigates the applications of AI in public health surveillance across the continent, presenting successful case studies and examining the benefits, opportunities, and challenges of implementing AI technologies in African healthcare settings. Our paper highlights AI's potential to enhance disease monitoring and health outcomes, and support effective public health interventions. The findings presented in the paper demonstrate that AI can significantly improve the accuracy and timeliness of disease detection and prediction, optimize resource allocation, and facilitate targeted public health strategies. Additionally, our paper identified key barriers to the widespread adoption of AI in African public health systems and proposed actionable recommendations to overcome these challenges.
DustNet: skillful neural network predictions of Saharan dust
Nowak, Trish E., Augousti, Andy T., Simmons, Benno I., Siegert, Stefan
Suspended in the atmosphere are millions of tonnes of mineral dust which interacts with weather and climate. Accurate representation of mineral dust in weather models is vital, yet remains challenging. Large scale weather models use high power supercomputers and take hours to complete the forecast. Such computational burden allows them to only include monthly climatological means of mineral dust as input states inhibiting their forecasting accuracy. Here, we introduce DustNet a simple, accurate and super fast forecasting model for 24-hours ahead predictions of aerosol optical depth AOD. DustNet trains in less than 8 minutes and creates predictions in 2 seconds on a desktop computer. Created by DustNet predictions outperform the state-of-the-art physics-based model on coarse 1 x 1 degree resolution at 95% of grid locations when compared to ground truth satellite data. Our results show DustNet has a potential for fast and accurate AOD forecasting which could transform our understanding of dust impacts on weather patterns.
Analyzing COVID-19 Vaccination Sentiments in Nigerian Cyberspace: Insights from a Manually Annotated Twitter Dataset
Ahmad, Ibrahim Said, Aliyu, Lukman Jibril, Khalid, Abubakar Auwal, Aliyu, Saminu Muhammad, Muhammad, Shamsuddeen Hassan, Abdulmumin, Idris, Abduljalil, Bala Mairiga, Bello, Bello Shehu, Abubakar, Amina Imam
Numerous successes have been achieved in combating the COVID-19 pandemic, initially using various precautionary measures like lockdowns, social distancing, and the use of face masks. More recently, various vaccinations have been developed to aid in the prevention or reduction of the severity of the COVID-19 infection. Despite the effectiveness of the precautionary measures and the vaccines, there are several controversies that are massively shared on social media platforms like Twitter. In this paper, we explore the use of state-of-the-art transformer-based language models to study people's acceptance of vaccines in Nigeria. We developed a novel dataset by crawling multi-lingual tweets using relevant hashtags and keywords. Our analysis and visualizations revealed that most tweets expressed neutral sentiments about COVID-19 vaccines, with some individuals expressing positive views, and there was no strong preference for specific vaccine types, although Moderna received slightly more positive sentiment. We also found out that fine-tuning a pre-trained LLM with an appropriate dataset can yield competitive results, even if the LLM was not initially pre-trained on the specific language of that dataset.
Leveraging Closed-Access Multilingual Embedding for Automatic Sentence Alignment in Low Resource Languages
Abdulmumin, Idris, Khalid, Auwal Abubakar, Muhammad, Shamsuddeen Hassan, Ahmad, Ibrahim Said, Aliyu, Lukman Jibril, Sani, Babangida, Abduljalil, Bala Mairiga, Hassan, Sani Ahmad
The importance of qualitative parallel data in machine translation has long been determined but it has always been very difficult to obtain such in sufficient quantity for the majority of world languages, mainly because of the associated cost and also the lack of accessibility to these languages. Despite the potential for obtaining parallel datasets from online articles using automatic approaches, forensic investigations have found a lot of quality-related issues such as misalignment, and wrong language codes. In this work, we present a simple but qualitative parallel sentence aligner that carefully leveraged the closed-access Cohere multilingual embedding, a solution that ranked second in the just concluded #CoHereAIHack 2023 Challenge (see https://ai6lagos.devpost.com). The proposed approach achieved $94.96$ and $54.83$ f1 scores on FLORES and MAFAND-MT, compared to $3.64$ and $0.64$ of LASER respectively. Our method also achieved an improvement of more than 5 BLEU scores over LASER, when the resulting datasets were used with MAFAND-MT dataset to train translation models. Our code and data are available for research purposes here (https://github.com/abumafrim/Cohere-Align).
HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language
Parida, Shantipriya, Abdulmumin, Idris, Muhammad, Shamsuddeen Hassan, Bose, Aneesh, Kohli, Guneet Singh, Ahmad, Ibrahim Said, Kotwal, Ketan, Sarkar, Sayan Deb, Bojar, Ondřej, Kakudi, Habeebah Adamu
This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fashion that guarantees their semantic match with the corresponding visual information. We conducted several baseline experiments on the dataset, including visual question answering, visual question elicitation, text-only and multimodal machine translation.
MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages
Dione, Cheikh M. Bamba, Adelani, David, Nabende, Peter, Alabi, Jesujoba, Sindane, Thapelo, Buzaaba, Happy, Muhammad, Shamsuddeen Hassan, Emezue, Chris Chinenye, Ogayo, Perez, Aremu, Anuoluwapo, Gitau, Catherine, Mbaye, Derguene, Mukiibi, Jonathan, Sibanda, Blessing, Dossou, Bonaventure F. P., Bukula, Andiswa, Mabuya, Rooweither, Tapo, Allahsera Auguste, Munkoh-Buabeng, Edwin, Koagne, victoire Memdjokam, Kabore, Fatoumata Ouoba, Taylor, Amelia, Kalipe, Godson, Macucwa, Tebogo, Marivate, Vukosi, Gwadabe, Tajuddeen, Elvis, Mboning Tchiaze, Onyenwe, Ikechukwu, Atindogbe, Gratien, Adelani, Tolulope, Akinade, Idris, Samuel, Olanrewaju, Nahimana, Marien, Musabeyezu, Théogène, Niyomutabazi, Emile, Chimhenga, Ester, Gotosa, Kudzai, Mizha, Patrick, Agbolo, Apelete, Traore, Seydou, Uchechukwu, Chinedu, Yusuf, Aliyu, Abdullahi, Muhammad, Klakow, Dietrich
In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in UD. Evaluating on the MasakhaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with cross-lingual parameter-efficient fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems more effective for POS tagging in unseen languages.
HERDPhobia: A Dataset for Hate Speech against Fulani in Nigeria
Aliyu, Saminu Mohammad, Wajiga, Gregory Maksha, Murtala, Muhammad, Muhammad, Shamsuddeen Hassan, Abdulmumin, Idris, Ahmad, Ibrahim Said
Social media platforms allow users to freely share their opinions about issues or anything they feel like. However, they also make it easier to spread hate and abusive content. The Fulani ethnic group has been the victim of this unfortunate phenomenon. This paper introduces the HERDPhobia - the first annotated hate speech dataset on Fulani herders in Nigeria - in three languages: English, Nigerian-Pidgin, and Hausa. We present a benchmark experiment using pre-trained languages models to classify the tweets as either hateful or non-hateful. Our experiment shows that the XML-T model provides better performance with 99.83% weighted F1. We released the dataset at https://github.com/hausanlp/HERDPhobia for further research.
Deep Sequence Models for Text Classification Tasks
Abdullahi, Saheed Salahudeen, Yiming, Sun, Muhammad, Shamsuddeen Hassan, Mustapha, Abdulrasheed, Aminu, Ahmad Muhammad, Abdullahi, Abdulkadir, Bello, Musa, Aliyu, Saminu Mohammad
The exponential growth of data generated on the Internet in the current information age is a driving force for the digital economy. Extraction of information is the major value in an accumulated big data. Big data dependency on statistical analysis and hand-engineered rules machine learning algorithms are overwhelmed with vast complexities inherent in human languages. Natural Language Processing (NLP) is equipping machines to understand these human diverse and complicated languages. Text Classification is an NLP task which automatically identifies patterns based on predefined or undefined labeled sets. Common text classification application includes information retrieval, modeling news topic, theme extraction, sentiment analysis, and spam detection. In texts, some sequences of words depend on the previous or next word sequences to make full meaning; this is a challenging dependency task that requires the machine to be able to store some previous important information to impact future meaning. Sequence models such as RNN, GRU, and LSTM is a breakthrough for tasks with long-range dependencies. As such, we applied these models to Binary and Multi-class classification. Results generated were excellent with most of the models performing within the range of 80% and 94%. However, this result is not exhaustive as we believe there is room for improvement if machines are to compete with humans.