Shatnawi, Sara
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Alwajih, Fakhraddin, Mekki, Abdellah El, Magdy, Samar Mohamed, Elmadany, Abdelrahim A., Nacar, Omer, Nagoudi, El Moatez Billah, Abdel-Salam, Reem, Atwany, Hanin, Nafea, Youssef, Yahya, Abdulfattah Mohammed, Alhamouri, Rahaf, Alsayadi, Hamzah A., Zayed, Hiba, Shatnawi, Sara, Sibaee, Serry, Ech-Chammakhy, Yasir, Al-Dhabyani, Walid, Ali, Marwa Mohamed, Jarraya, Imen, El-Shangiti, Ahmed Oumar, Alraeesi, Aisha, Al-Ghrawi, Mohammed Anwar, Al-Batati, Abdulrahman S., Mohamed, Elgizouli, Elgindi, Noha Taha, Saeed, Muhammed, Atou, Houdaifa, Yahia, Issam Ait, Bouayad, Abdelhak, Machrouh, Mohammed, Makouar, Amal, Alkawi, Dania, Mohamed, Mukhtar, Abdelfadil, Safaa Taher, Ounnoughene, Amine Ziad, Anfel, Rouabhia, Assi, Rwaa, Sorkatti, Ahmed, Tourad, Mohamedou Cheikh, Koubaa, Anis, Berrada, Ismail, Jarrar, Mustafa, Shehata, Shady, Abdul-Mageed, Muhammad
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
Commonsense Reasoning in Arab Culture
Sadallah, Abdelrahman, Tonga, Junior Cedric, Almubarak, Khalid, Almheiri, Saeed, Atif, Farah, Qwaider, Chatrine, Kadaoui, Karima, Shatnawi, Sara, Alesh, Yaser, Koto, Fajri
Despite progress in Arabic large language models, such as Jais and AceGPT, their evaluation on commonsense reasoning has largely relied on machine-translated datasets, which lack cultural depth and may introduce Anglocentric biases. Commonsense reasoning is shaped by geographical and cultural contexts, and existing English datasets fail to capture the diversity of the Arab world. To address this, we introduce \datasetname, a commonsense reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13 countries across the Gulf, Levant, North Africa, and the Nile Valley. The dataset was built from scratch by engaging native speakers to write and validate culturally relevant questions for their respective countries. \datasetname spans 12 daily life domains with 54 fine-grained subtopics, reflecting various aspects of social norms, traditions, and everyday experiences. Zero-shot evaluations show that open-weight language models with up to 32B parameters struggle to comprehend diverse Arab cultures, with performance varying across regions. These findings highlight the need for more culturally aware models and datasets tailored to the Arabic-speaking world.
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition
Talafha, Bashar, Kadaoui, Karima, Magdy, Samar Mohamed, Habiboullah, Mariem, Chafei, Chafei Mohamed, El-Shangiti, Ahmed Oumar, Zayed, Hiba, tourad, Mohamedou cheikh, Alhamouri, Rahaf, Assi, Rwaa, Alraeesi, Aisha, Mohamed, Hour, Alwajih, Fakhraddin, Mohamed, Abdelrahman, Mekki, Abdellah El, Nagoudi, El Moatez Billah, Saadia, Benelhadj Djelloul Mama, Alsayadi, Hamzah A., Al-Dhabyani, Walid, Shatnawi, Sara, Ech-Chammakhy, Yasir, Makouar, Amal, Berrachedi, Yousra, Jarrar, Mustafa, Shehata, Shady, Berrada, Ismail, Abdul-Mageed, Muhammad
Arabic encompasses a diverse array of for a select few languages. This bias towards linguistic varieties, many of which are nearly mutually resource-rich languages leaves behind the majority unintelligible (Watson, 2007; Abdul-Mageed of the world's languages (Bartelds et al., 2023; et al., 2024). This diversity includes three primary Talafha et al., 2023; Meelen et al., 2024; Tonja categories: Classical Arabic, historically used in et al., 2024). In this work, we report our efforts literature and still employed in religious contexts; to alleviate this challenge for Arabic--a collection Modern Standard Arabic (MSA), used in media, of languages and dialects spoken by more than education, and governmental settings; and numerous 450 million people. We detail a year-long community colloquial dialects, which are the main forms effort to collect and annotate a novel dataset of daily communication across the Arab world and for eight Arabic dialects spanning both Africa and often involve code-switching (Abdul-Mageed et al., Asia. This new dataset, dubbed Casablanca, is rich 2020; Mubarak et al., 2021).
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
Koto, Fajri, Li, Haonan, Shatnawi, Sara, Doughman, Jad, Sadallah, Abdelrahman Boda, Alraeesi, Aisha, Almubarak, Khalid, Alyafeai, Zaid, Sengupta, Neha, Shehata, Shady, Habash, Nizar, Nakov, Preslav, Baldwin, Timothy
The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present ArabicMMLU, the first multi-task language understanding benchmark for Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA), and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLama2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%.
Automatic Restoration of Diacritics for Speech Data Sets
Shatnawi, Sara, Alqahtani, Sawsan, Aldarmaki, Hanan
Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing the parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for a transformer-based diacritic restoration model. The proposed model consistently improve diacritic restoration performance compared to an equivalent text-only model, with at least 5\% absolute reduction in diacritic error rate within the same domain and on two out-of-domain test sets. Our results underscore the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.