Y-NQ: English-Yor\`ub\'a Evaluation dataset for Open-Book Reading Comprehension and Text Generation

Costa-jussà, Marta R., Chen, Joy, Adebara, Ifeoluwanimi, Chuang, Joe, Ropers, Christophe, Sánchez, Eduardo

arXiv.org Artificial Intelligence 

This study explores the intersection of reading comprehension and text generation, examining how models perform on tasks requiring both in-context understanding (i.e., open-book model, where the model has access to the context document during inference to answer a particular question) and generative text production (i.e. the answer is free-text which has to be compared to a gold standard reference). We aim to investigate the performance of this task in two languages: a high-resource language (English) and a low-resource language (Yorùbá). For this, we introduce Y-NQ (Yorùbá Natural Questions) a comprehensive open-book questionanswer dataset (Section 2). Y-NQ is sourced from NQ (Kwiatkowski et al., 2019) and provides a complete article context for informed answers and text generation tasks, and parallel documents on the same topic for both high-and low-resource languages. The data set also includes the comparability of the responses in languages. As a result, we are increasing Natural Language Processing (NLP) resources in Yorùbá (Ahia et al., 2024). Our data set is benchmarked against state-of-the-art Large Language Models (LLMs). The results and analysis (Section 3) shows that responses in Yorùbá are more inaccurate than those in English. As a by-product of human annotations, we identify inaccuracies in the English-language version of some Wikipedia articles (26 incorrect answers out of 1,566 humanly analyzed questions in the English-language subset of articles), which confirms the existence of accuracy discrepancies across languages for the same Wikipedia topics, thus supporting, for example, the need to better interlink Wikipedia articles across languages (Klang and Nugues, 2016).