Adelani, David Ifeoluwa
Mitigating Translationese in Low-resource Languages: The Storyboard Approach
Kuwanto, Garry, Urua, Eno-Abasi E., Amuok, Priscilla Amondi, Muhammad, Shamsuddeen Hassan, Aremu, Anuoluwapo, Otiende, Verrah, Nanyanga, Loice Emma, Nyoike, Teresiah W., Akpan, Aniefon D., Udouboh, Nsima Ab, Archibong, Idongesit Udeme, Moses, Idara Effiong, Ige, Ifeoluwatayo A., Ajibade, Benjamin, Awokoya, Olumide Benjamin, Abdulmumin, Idris, Aliyu, Saminu Mohammad, Iro, Ruqayya Nasir, Ahmad, Ibrahim Said, Smith, Deontae, Michaels, Praise-EL, Adelani, David Ifeoluwa, Wijaya, Derry Tanti, Andy, Anietie
Low-resource languages often face challenges in acquiring high-quality language data due to the reliance on translation-based methods, which can introduce the translationese effect. This phenomenon results in translated sentences that lack fluency and naturalness in the target language. In this paper, we propose a novel approach for data collection by leveraging storyboards to elicit more fluent and natural sentences. Our method involves presenting native speakers with visual stimuli in the form of storyboards and collecting their descriptions without direct exposure to the source text. We conducted a comprehensive evaluation comparing our storyboard-based approach with traditional text translation-based methods in terms of accuracy and fluency. Human annotators and quantitative metrics were used to assess translation quality. The results indicate a preference for text translation in terms of accuracy, while our method demonstrates worse accuracy but better fluency in the language focused.
Voices Unheard: NLP Resources and Models for Yor\`ub\'a Regional Dialects
Ahia, Orevaoghene, Aremu, Anuoluwapo, Abagyan, Diana, Gonen, Hila, Adelani, David Ifeoluwa, Abolade, Daud, Smith, Noah A., Tsvetkov, Yulia
Yor\`ub\'a an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YOR\`ULECT across three domains and four regional Yor\`ub\'a dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yor\`ub\'a and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yor\`ub\'a and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YOR\`ULECT dataset and models publicly under an open license.
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
Romero, David, Lyu, Chenyang, Wibowo, Haryo Akbarianto, Lynn, Teresa, Hamed, Injy, Kishore, Aditya Nanda, Mandal, Aishik, Dragonetti, Alina, Abzaliev, Artem, Tonja, Atnafu Lambebo, Balcha, Bontu Fufa, Whitehouse, Chenxi, Salamea, Christian, Velasco, Dan John, Adelani, David Ifeoluwa, Meur, David Le, Villa-Cueva, Emilio, Koto, Fajri, Farooqui, Fauzan, Belcavello, Frederico, Batnasan, Ganzorig, Vallejo, Gisela, Caulfield, Grainne, Ivetta, Guido, Song, Haiyue, Ademtew, Henok Biadglign, Maina, Hernรกn, Lovenia, Holy, Azime, Israel Abebe, Cruz, Jan Christian Blaise, Gala, Jay, Geng, Jiahui, Ortiz-Barajas, Jesus-German, Baek, Jinheon, Dunstan, Jocelyn, Alemany, Laura Alonso, Nagasinghe, Kumaranage Ravindu Yasas, Benotti, Luciana, D'Haro, Luis Fernando, Viridiano, Marcelo, Estecha-Garitagoitia, Marcos, Cabrera, Maria Camila Buitrago, Rodrรญguez-Cantelar, Mario, Jouitteau, Mรฉlanie, Mihaylov, Mihail, Imam, Mohamed Fazli Mohamed, Adilazuarda, Muhammad Farid, Gochoo, Munkhjargal, Otgonbold, Munkh-Erdene, Etori, Naome, Niyomugisha, Olivier, Silva, Paula Mรณnica, Chitale, Pranjal, Dabre, Raj, Chevi, Rendi, Zhang, Ruochen, Diandaru, Ryandito, Cahyawijaya, Samuel, Gรณngora, Santiago, Jeong, Soyeong, Purkayastha, Sukannya, Kuribayashi, Tatsuki, Jayakumar, Thanmay, Torrent, Tiago Timponi, Ehsan, Toqeer, Araujo, Vladimir, Kementchedjhieva, Yova, Burzo, Zara, Lim, Zheng Wei, Yong, Zheng Xin, Ignat, Oana, Nwatu, Joan, Mihalcea, Rada, Solorio, Thamar, Aji, Alham Fikri
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
Adelani, David Ifeoluwa, Ojo, Jessica, Azime, Israel Abebe, Zhuang, Jian Yun, Alabi, Jesujoba O., He, Xuanli, Ochieng, Millicent, Hooker, Sara, Bukula, Andiswa, Lee, En-Shiun Annie, Chukwuneke, Chiamaka, Buzaaba, Happy, Sibanda, Blessing, Kalipe, Godson, Mukiibi, Jonathan, Kabongo, Salomon, Yuehgoh, Foutse, Setaka, Mmasibidi, Ndolela, Lolwethu, Odu, Nkiruka, Mabuya, Rooweither, Muhammad, Shamsuddeen Hassan, Osei, Salomey, Samb, Sokhar, Guge, Tadesse Kebede, Stenetorp, Pontus
Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g., African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench--a human-translated benchmark dataset for 16 typologicallydiverse low-resource African languages covering three tasks: natural language inference (AfriXNLI), mathematical reasoning (AfriMGSM), and multi-choice knowledge-based QA (AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings (where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages (such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like LLaMa 3 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.
Comparing LLM prompting with Cross-lingual transfer performance on Indigenous and Low-resource Brazilian Languages
Adelani, David Ifeoluwa, Doฤruรถz, A. Seza, Coneglian, Andrรฉ, Ojha, Atul Kr.
Large Language Models are transforming NLP for a variety of tasks. However, how LLMs perform NLP tasks for low-resource languages (LRLs) is less explored. In line with the goals of the AmericasNLP workshop, we focus on 12 LRLs from Brazil, 2 LRLs from Africa and 2 high-resource languages (HRLs) (e.g., English and Brazilian Portuguese). Our results indicate that the LLMs perform worse for the part of speech (POS) labeling of LRLs in comparison to HRLs. We explain the reasons behind this failure and provide an error analysis through examples observed in our data set.
How good are Large Language Models on African Languages?
Ojo, Jessica, Ogueji, Kelechi, Stenetorp, Pontus, Adelani, David Ifeoluwa
Recent advancements in natural language processing have led to the proliferation of large language models (LLMs). These models have been shown to yield good performance, using in-context learning, even on tasks and languages they are not trained on. However, their performance on African languages is largely understudied relative to high-resource languages. We present an analysis of four popular large language models (mT0, Aya, LLaMa 2, and GPT-4) on six tasks (topic classification, sentiment classification, machine translation, summarization, question answering, and named entity recognition) across 60 African languages, spanning different language families and geographical regions. Our results suggest that all LLMs produce lower performance for African languages, and there is a large gap in performance compared to high-resource languages (such as English) for most tasks. We find that GPT-4 has an average to good performance on classification tasks, yet its performance on generative tasks such as machine translation and summarization is significantly lacking. Surprisingly, we find that mT0 had the best overall performance for cross-lingual QA, better than the state-of-the-art supervised model (i.e. fine-tuned mT5) and GPT-4 on African languages. Similarly, we find the recent Aya model to have comparable result to mT0 in almost all tasks except for topic classification where it outperform mT0. Overall, LLaMa 2 showed the worst performance, which we believe is due to its English and code-centric~(around 98%) pre-training corpus. Our findings confirms that performance on African languages continues to remain a hurdle for the current LLMs, underscoring the need for additional efforts to close this gap.
Which Nigerian-Pidgin does Generative AI speak?: Issues about Representativeness and Bias for Multilingual and Low Resource Languages
Adelani, David Ifeoluwa, Doฤruรถz, A. Seza, Shode, Iyanuoluwa, Aremu, Anuoluwapo
Naija is the Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria and it is a mixed language (e.g., English, Portuguese and Indigenous languages). Although it has mainly been a spoken language until recently, there are currently two written genres (BBC and Wikipedia) in Naija. Through statistical analyses and Machine Translation experiments, we prove that these two genres do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on Naija written in the BBC genre. In other words, Naija written in Wikipedia genre is not represented in Generative AI.
EkoHate: Abusive Language and Hate Speech Detection for Code-switched Political Discussions on Nigerian Twitter
Ilevbare, Comfort Eseohen, Alabi, Jesujoba O., Adelani, David Ifeoluwa, Bakare, Firdous Damilola, Abiola, Oluwatoyin Bunmi, Adeyemo, Oluwaseyi Adesina
Nigerians have a notable online presence and actively discuss political and topical matters. This was particularly evident throughout the 2023 general election, where Twitter was used for campaigning, fact-checking and verification, and even positive and negative discourse. However, little or none has been done in the detection of abusive language and hate speech in Nigeria. In this paper, we curated code-switched Twitter data directed at three musketeers of the governorship election on the most populous and economically vibrant state in Nigeria; Lagos state, with the view to detect offensive speech in political discussions. We developed EkoHate -- an abusive language and hate speech dataset for political discussions between the three candidates and their followers using a binary (normal vs offensive) and fine-grained four-label annotation scheme. We analysed our dataset and provided an empirical evaluation of state-of-the-art methods across both supervised and cross-lingual transfer learning settings. In the supervised setting, our evaluation results in both binary and four-label annotation schemes show that we can achieve 95.1 and 70.3 F1 points respectively. Furthermore, we show that our dataset adequately transfers very well to three publicly available offensive datasets (OLID, HateUS2020, and FountaHate), generalizing to political discussions in other regions like the US.
ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model
Quinjica, Osvaldo Luamba, Adelani, David Ifeoluwa
In recent years, the development of pre-trained language models (PLMs) has gained momentum, showcasing their capacity to transcend linguistic barriers and facilitate knowledge transfer across diverse languages. However, this progress has predominantly bypassed the inclusion of very-low resource languages, creating a notable void in the multilingual landscape. This paper addresses this gap by introducing four tailored PLMs specifically finetuned for Angolan languages, employing a Multilingual Adaptive Fine-tuning (MAFT) approach. In this paper, we survey the role of informed embedding initialization and synthetic data in enhancing the performance of MAFT models in downstream tasks. We improve baseline over SOTA AfroXLMR-base (developed through MAFT) and OFA (an effective embedding initialization) by 12.3 and 3.8 points respectively.
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
Adelani, David Ifeoluwa, Liu, Hannah, Shen, Xiaoyu, Vassilyev, Nikita, Alabi, Jesujoba O., Mao, Yanke, Gao, Haonan, Lee, Annie En-Shiun
Despite the progress we have recorded in the last few years in multilingual natural language processing, evaluation is typically limited to a small set of languages with available datasets which excludes a large number of low-resource languages. In this paper, we created SIB-200 -- a large-scale open-sourced benchmark dataset for topic classification in 200 languages and dialects to address the lack of evaluation dataset for Natural Language Understanding (NLU). For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for NLU. The dataset is based on Flores-200 machine translation corpus. We annotated the English portion of the dataset and extended the sentence-level annotation to the remaining 203 languages covered in the corpus. Despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages. We found that languages unseen during the pre-training of multilingual language models, under-represented language families (like Nilotic and Altantic-Congo), and languages from the regions of Africa, Americas, Oceania and South East Asia, often have the lowest performance on our topic classification dataset. We hope our dataset will encourage a more inclusive evaluation of multilingual language models on a more diverse set of languages. https://github.com/dadelani/sib-200