El-Shangiti, Ahmed Oumar
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Alwajih, Fakhraddin, Mekki, Abdellah El, Magdy, Samar Mohamed, Elmadany, Abdelrahim A., Nacar, Omer, Nagoudi, El Moatez Billah, Abdel-Salam, Reem, Atwany, Hanin, Nafea, Youssef, Yahya, Abdulfattah Mohammed, Alhamouri, Rahaf, Alsayadi, Hamzah A., Zayed, Hiba, Shatnawi, Sara, Sibaee, Serry, Ech-Chammakhy, Yasir, Al-Dhabyani, Walid, Ali, Marwa Mohamed, Jarraya, Imen, El-Shangiti, Ahmed Oumar, Alraeesi, Aisha, Al-Ghrawi, Mohammed Anwar, Al-Batati, Abdulrahman S., Mohamed, Elgizouli, Elgindi, Noha Taha, Saeed, Muhammed, Atou, Houdaifa, Yahia, Issam Ait, Bouayad, Abdelhak, Machrouh, Mohammed, Makouar, Amal, Alkawi, Dania, Mohamed, Mukhtar, Abdelfadil, Safaa Taher, Ounnoughene, Amine Ziad, Anfel, Rouabhia, Assi, Rwaa, Sorkatti, Ahmed, Tourad, Mohamedou Cheikh, Koubaa, Anis, Berrada, Ismail, Jarrar, Mustafa, Shehata, Shady, Abdul-Mageed, Muhammad
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
Number Representations in LLMs: A Computational Parallel to Human Perception
AlquBoj, H. V., AlQuabeh, Hilal, Bojkovic, Velibor, Hiraoka, Tatsuya, El-Shangiti, Ahmed Oumar, Nwadike, Munachiso, Inui, Kentaro
Humans are believed to perceive numbers on a logarithmic mental number line, where smaller values are represented with greater resolution than larger ones. This cognitive bias, supported by neuroscience and behavioral studies, suggests that numerical magnitudes are processed in a sublinear fashion rather than on a uniform linear scale. Inspired by this hypothesis, we investigate whether large language models (LLMs) exhibit a similar logarithmic-like structure in their internal numerical representations. By analyzing how numerical values are encoded across different layers of LLMs, we apply dimensionality reduction techniques such as PCA and PLS followed by geometric regression to uncover latent structures in the learned embeddings. Our findings reveal that the model's numerical representations exhibit sublinear spacing, with distances between values aligning with a logarithmic scale. This suggests that LLMs, much like humans, may encode numbers in a compressed, non-uniform manner.
The Geometry of Numerical Reasoning: Language Models Compare Numeric Properties in Linear Subspaces
El-Shangiti, Ahmed Oumar, Hiraoka, Tatsuya, AlQuabeh, Hilal, Heinzerling, Benjamin, Inui, Kentaro
We first identified, using partial least square regression, these subspaces, which effectively encode the numerical attributes associated with the entities in comparison prompts. Further, we demonstrate causality, by intervening in these subspaces to manipulate hidden Figure 1: Summary of our approach. We extract contextualized states, thereby altering the LLM's comparison numeric attribute activations and then train outcomes. Experimental results demonstrated k-components PLS model on the activations to predict that our results stand for different numerical their values and then use the first component of the PLS attributes, which indicates that LLMs utilize model to do intervention at the last token of the second the linearly encoded information for numerical entity in the logical comparison.
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition
Talafha, Bashar, Kadaoui, Karima, Magdy, Samar Mohamed, Habiboullah, Mariem, Chafei, Chafei Mohamed, El-Shangiti, Ahmed Oumar, Zayed, Hiba, tourad, Mohamedou cheikh, Alhamouri, Rahaf, Assi, Rwaa, Alraeesi, Aisha, Mohamed, Hour, Alwajih, Fakhraddin, Mohamed, Abdelrahman, Mekki, Abdellah El, Nagoudi, El Moatez Billah, Saadia, Benelhadj Djelloul Mama, Alsayadi, Hamzah A., Al-Dhabyani, Walid, Shatnawi, Sara, Ech-Chammakhy, Yasir, Makouar, Amal, Berrachedi, Yousra, Jarrar, Mustafa, Shehata, Shady, Berrada, Ismail, Abdul-Mageed, Muhammad
Arabic encompasses a diverse array of for a select few languages. This bias towards linguistic varieties, many of which are nearly mutually resource-rich languages leaves behind the majority unintelligible (Watson, 2007; Abdul-Mageed of the world's languages (Bartelds et al., 2023; et al., 2024). This diversity includes three primary Talafha et al., 2023; Meelen et al., 2024; Tonja categories: Classical Arabic, historically used in et al., 2024). In this work, we report our efforts literature and still employed in religious contexts; to alleviate this challenge for Arabic--a collection Modern Standard Arabic (MSA), used in media, of languages and dialects spoken by more than education, and governmental settings; and numerous 450 million people. We detail a year-long community colloquial dialects, which are the main forms effort to collect and annotate a novel dataset of daily communication across the Arab world and for eight Arabic dialects spanning both Africa and often involve code-switching (Abdul-Mageed et al., Asia. This new dataset, dubbed Casablanca, is rich 2020; Mubarak et al., 2021).
Arabic Automatic Story Generation with Large Language Models
El-Shangiti, Ahmed Oumar, Alwajih, Fakhraddin, Abdul-Mageed, Muhammad
Large language models (LLMs) have recently emerged as a powerful tool for a wide range of language generation tasks. Nevertheless, this progress has been slower in Arabic. In this work, we focus on the task of generating stories from LLMs. For our training, we use stories acquired through machine translation (MT) as well as GPT-4. For the MT data, we develop a careful pipeline that ensures we acquire high-quality stories. For our GPT-41 data, we introduce crafted prompts that allow us to generate data well-suited to the Arabic context in both Modern Standard Arabic (MSA) and two Arabic dialects (Egyptian and Moroccan). For example, we generate stories tailored to various Arab countries on a wide host of topics. Our manual evaluation shows that our model fine-tuned on these training datasets can generate coherent stories that adhere to our instructions. We also conduct an extensive automatic and human evaluation comparing our models against state-of-the-art proprietary and open-source models. Our datasets and models will be made publicly available at https: //github.com/UBC-NLP/arastories.
Arabic Fine-Grained Entity Recognition
Liqreina, Haneen, Jarrar, Mustafa, Khalilia, Mohammed, El-Shangiti, Ahmed Oumar, Abdul-Mageed, Muhammad
Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC), are extended with 31 subtypes. To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE sub-types. We refer to this extended version of Wojood as WojoodF ine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic BERT encoders in three settings: flat NER, nested NER and nested NER with subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open-source and available at https://sina.birzeit.edu/wojood/.
TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties
Kadaoui, Karima, Magdy, Samar M., Waheed, Abdul, Khondaker, Md Tawkat Islam, El-Shangiti, Ahmed Oumar, Nagoudi, El Moatez Billah, Abdul-Mageed, Muhammad
Despite the purported multilingual proficiency of instruction-finetuned large language models (LLMs) such as ChatGPT and Bard, the linguistic inclusivity of these models remains insufficiently explored. Considering this constraint, we present a thorough assessment of Bard and ChatGPT (encompassing both GPT-3.5 and GPT-4) regarding their machine translation proficiencies across ten varieties of Arabic. Our evaluation covers diverse Arabic varieties such as Classical Arabic (CA), Modern Standard Arabic (MSA), and several country-level dialectal variants. Our analysis indicates that LLMs may encounter challenges with dialects for which minimal public datasets exist, but on average are better translators of dialects than existing commercial systems. On CA and MSA, instruction-tuned LLMs, however, trail behind commercial systems such as Google Translate. Finally, we undertake a human-centric study to scrutinize the efficacy of the relatively recent model, Bard, in following human instructions during translation tasks. Our analysis reveals a circumscribed capability of Bard in aligning with human instructions in translation contexts. Collectively, our findings underscore that prevailing LLMs remain far from inclusive, with only limited ability to cater for the linguistic and cultural intricacies of diverse communities.
QCRI at SemEval-2023 Task 3: News Genre, Framing and Persuasion Techniques Detection using Multilingual Models
Hasanain, Maram, El-Shangiti, Ahmed Oumar, Nandi, Rabindra Nath, Nakov, Preslav, Alam, Firoj
Misinformation spreading in mainstream and social media has been misleading users in different ways. Manual detection and verification efforts by journalists and fact-checkers can no longer cope with the great scale and quick spread of misleading information. This motivated research and industry efforts to develop systems for analyzing and verifying news spreading online. The SemEval-2023 Task 3 is an attempt to address several subtasks under this overarching problem, targeting writing techniques used in news articles to affect readers' opinions. The task addressed three subtasks with six languages, in addition to three ``surprise'' test languages, resulting in 27 different test setups. This paper describes our participating system to this task. Our team is one of the 6 teams that successfully submitted runs for all setups. The official results show that our system is ranked among the top 3 systems for 10 out of the 27 setups.