Goto

Collaborating Authors

 Adelani, David I.


YAD: Leveraging T5 for Improved Automatic Diacritization of Yor\`ub\'a Text

arXiv.org Artificial Intelligence

In addition, we pre-train text-to-text transformer, T5 model for Yorรนbรก and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yorรนbรก Introduction Yorรนbรก, a language spoken predominantly in West Africa, is renowned for its tonal nature which is characterized by a heavy use of diacritics to signify tone variations. In Yorรนbรก and many other languages, diacritics play a crucial role in disambiguating word meanings and in word pronunciation, making accurate diacritization essential for effective communication and language processing tasks (Skiredj & Berrada, 2024). However, manual diacritization is time-consuming and requires specialized linguistic expertise, motivating the development of automatic diacritization systems. In recent years, significant progress has been made in natural language processing (NLP) techniques, leading to the exploration of various approaches to automate the diacritization process for languages using diacritics (Nรกplava et al., 2018; Mubarak et al., 2019; Nรกplava et al., 2021; Stankevicius et al., 2022, inter alia) including Yorรนbรก (Orife, 2018; Orife et al., 2020).


Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

arXiv.org Artificial Intelligence

Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks. These biases stem not only from language but also from the cultural knowledge required to interpret questions, reducing the practical utility of translated datasets like MMLU. Furthermore, translation often introduces artifacts that can distort the meaning or clarity of questions in the target language. A common practice in multilingual evaluation is to rely on machine-translated evaluation sets, but simply translating a dataset is insufficient to address these challenges. In this work, we trace the impact of both of these issues on multilingual evaluations and ensuing model performances. Our large-scale evaluation of state-of-the-art open and proprietary models illustrates that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge. Moreover, for questions requiring geographic knowledge, an astounding 84.9% focus on either North American or European regions. Rankings of model evaluations change depending on whether they are evaluated on the full portion or the subset of questions annotated as culturally sensitive, showing the distortion to model rankings when blindly relying on translated MMLU. We release Global-MMLU, an improved MMLU with evaluation coverage across 42 languages -- with improved overall quality by engaging with compensated professional and community annotators to verify translation quality while also rigorously evaluating cultural biases present in the original dataset. This comprehensive Global-MMLU set also includes designated subsets labeled as culturally sensitive and culturally agnostic to allow for more holistic, complete evaluation.


XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

arXiv.org Artificial Intelligence

Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text-only, multi-modal (vision, audio, and text),supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models


MphayaNER: Named Entity Recognition for Tshivenda

arXiv.org Artificial Intelligence

Named Entity Recognition (NER) plays a vital role in various Natural Language Processing tasks such as information retrieval, text classification, and question answering. However, NER can be challenging, especially in low-resource languages with limited annotated datasets and tools. This paper adds to the effort of addressing these challenges by introducing MphayaNER, the first Tshivenda NER corpus in the news domain. We establish NER baselines by \textit{fine-tuning} state-of-the-art models on MphayaNER. The study also explores zero-shot transfer between Tshivenda and other related Bantu languages, with chiShona and Kiswahili showing the best results. Augmenting MphayaNER with chiShona data was also found to improve model performance significantly. Both MphayaNER and the baseline models are made publicly available.