Goto

Collaborating Authors

 Ech-Chammakhy, Yasir


Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

arXiv.org Artificial Intelligence

As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.


Casablanca: Data and Models for Multidialectal Arabic Speech Recognition

arXiv.org Artificial Intelligence

Arabic encompasses a diverse array of for a select few languages. This bias towards linguistic varieties, many of which are nearly mutually resource-rich languages leaves behind the majority unintelligible (Watson, 2007; Abdul-Mageed of the world's languages (Bartelds et al., 2023; et al., 2024). This diversity includes three primary Talafha et al., 2023; Meelen et al., 2024; Tonja categories: Classical Arabic, historically used in et al., 2024). In this work, we report our efforts literature and still employed in religious contexts; to alleviate this challenge for Arabic--a collection Modern Standard Arabic (MSA), used in media, of languages and dialects spoken by more than education, and governmental settings; and numerous 450 million people. We detail a year-long community colloquial dialects, which are the main forms effort to collect and annotate a novel dataset of daily communication across the Arab world and for eight Arabic dialects spanning both Africa and often involve code-switching (Abdul-Mageed et al., Asia. This new dataset, dubbed Casablanca, is rich 2020; Mubarak et al., 2021).