AITopics | arabic dialect identification

Collaborating Authors

arabic dialect identification

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks

Elleuch, Haroun, Saidi, Youssef, Mdhaffar, Salima, Estève, Yannick, Bougares, Fethi

arXiv.org Artificial IntelligenceNov-14-2025

This paper describes Elyadata \& LIA's joint submission to the NADI multi-dialectal Arabic Speech Processing 2025. We participated in the Spoken Arabic Dialect Identification (ADI) and multi-dialectal Arabic ASR subtasks. Our submission ranked first for the ADI subtask and second for the multi-dialectal Arabic ASR subtask among all participants. Our ADI system is a fine-tuned Whisper-large-v3 encoder with data augmentation. This system obtained the highest ADI accuracy score of \textbf{79.83\%} on the official test set. For multi-dialectal Arabic ASR, we fine-tuned SeamlessM4T-v2 Large (Egyptian variant) separately for each of the eight considered dialects. Overall, we obtained an average WER and CER of \textbf{38.54\%} and \textbf{14.53\%}, respectively, on the test set. Our results demonstrate the effectiveness of large pre-trained speech models with targeted fine-tuning for Arabic speech processing.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.arabicnlp-sharedtasks.105

2511.1009

Country:

Europe (1.00)
Africa > Middle East > Morocco (0.15)
Asia > Middle East > Republic of Türkiye (0.14)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.50)

Add feedback

DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

Altakrori, Malik H., Habash, Nizar, Freihat, Abdelhakim, Samih, Younes, Chirkunov, Kirill, AbuOdeh, Muhammed, Florian, Radu, Lynn, Teresa, Nakov, Preslav, Aji, Alham Fikri

arXiv.org Artificial IntelligenceNov-3-2025

We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.27543

Country:

Europe (1.00)
Africa (0.68)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.46)

Industry: Education > Curriculum (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Exploring Data and Parameter Efficient Strategies for Arabic Dialect Identifications

Kanjirangat, Vani, Dolamic, Ljiljana, Rinaldi, Fabio

arXiv.org Artificial IntelligenceSep-19-2025

This paper discusses our exploration of different data-efficient and parameter-efficient approaches to Arabic Dialect Identification (ADI). In particular, we investigate various soft-prompting strategies, including prefix-tuning, prompt-tuning, P-tuning, and P-tuning V2, as well as LoRA reparameterizations. For the data-efficient strategy, we analyze hard prompting with zero-shot and few-shot inferences to analyze the dialect identification capabilities of Large Language Models (LLMs). For the parameter-efficient PEFT approaches, we conducted our experiments using Arabic-specific encoder models on several major datasets. We also analyzed the n-shot inferences on open-source decoder-only models, a general multilingual model (Phi-3.5), and an Arabic-specific one(SILMA). We observed that the LLMs generally struggle to differentiate the dialectal nuances in the few-shot or zero-shot setups. The soft-prompted encoder variants perform better, while the LoRA-based fine-tuned models perform best, even surpassing full fine-tuning.

large language model, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2509.13775

Country:

Africa > Middle East (0.68)
Asia > Middle East > UAE (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness

Shaban, Sanad, Habash, Nizar

arXiv.org Artificial IntelligenceAug-26-2025

Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2508.17347

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

Add feedback

Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis

Essameldin, Omar A., Elbeih, Ali O., Gomaa, Wael H., Elsersy, Wael F.

arXiv.org Artificial IntelligenceJul-1-2025

--The Arabic language is among the most popular languages in the world with a huge variety of dialects spoken in 22 countries. In this study, we address the problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets. RNN models, Transformer models, and large language models (LLMs) via prompt engineering are created and tested. Among these, MARBERTv2 performed best with 65% accuracy and 64% F1-score. Through the use of state-of-the-art preprocessing techniques and the latest NLP models, this paper identifies the most significant linguistic issues in Arabic dialect identification. The results corroborate applications like personalized chatbots that respond in users' dialects, social media monitoring, and greater accessibility for Arabic communities.

f1-score, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2506.19753

Country: Africa > Middle East > Egypt (0.29)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Revisiting Common Assumptions about Arabic Dialects in NLP

Keleg, Amr, Goldwater, Sharon, Magdy, Walid

arXiv.org Artificial IntelligenceMay-29-2025

Arabic has diverse dialects, where one dialect can be substantially different from the others. In the NLP literature, some assumptions about these dialects are widely adopted (e.g., ``Arabic dialects can be grouped into distinguishable regional dialects") and are manifested in different computational tasks such as Arabic Dialect Identification (ADI). However, these assumptions are not quantitatively verified. We identify four of these assumptions and examine them by extending and analyzing a multi-label dataset, where the validity of each sentence in 11 different country-level dialects is manually assessed by speakers of these dialects. Our analysis indicates that the four assumptions oversimplify reality, and some of them are not always accurate. This in turn might be hindering further progress in different Arabic NLP tasks.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2505.21816

Country:

North America > United States (1.00)
Europe (1.00)
Africa > Middle East (1.00)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

dzNLP at NADI 2024 Shared Task: Multi-Classifier Ensemble with Weighted Voting and TF-IDF Features

Lichouri, Mohamed, Lounnas, Khaled, Zahaf, Boualem Nadjib, Rabiai, Mehdi Ayoub

arXiv.org Artificial IntelligenceJul-18-2024

This paper presents the contribution of our dzNLP team to the NADI 2024 shared task, specifically in Subtask 1 - Multi-label Country-level Dialect Identification (MLDID) (Closed Track). We explored various configurations to address the challenge: in Experiment 1, we utilized a union of n-gram analyzers (word, character, character with word boundaries) with different n-gram values; in Experiment 2, we combined a weighted union of Term Frequency-Inverse Document Frequency (TF-IDF) features with various weights; and in Experiment 3, we implemented a weighted major voting scheme using three classifiers: Linear Support Vector Classifier (LSVC), Random Forest (RF), and K-Nearest Neighbors (KNN). Our approach, despite its simplicity and reliance on traditional machine learning techniques, demonstrated competitive performance in terms of F1-score and precision. Notably, we achieved the highest precision score of 63.22% among the participating teams. However, our overall F1 score was approximately 21%, significantly impacted by a low recall rate of 12.87%. This indicates that while our models were highly precise, they struggled to recall a broad range of dialect labels, highlighting a critical area for improvement in handling diverse dialectal variations.

arabic dialect identification, classifier, dialect identification, (12 more...)

arXiv.org Artificial Intelligence

2407.13608

Country:

Africa > Middle East > Algeria > Algiers Province > Algiers (0.06)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.05)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.80)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)

Add feedback

USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature Engineering Strategies for Arabic Dialect Identification

Lichouri, Mohamed, Lounnas, Khaled, Zitouni, Aicha, Latrache, Houda, Djeradi, Rachida

arXiv.org Artificial IntelligenceDec-16-2023

In this paper, we conduct an in-depth analysis of several key factors influencing the performance of Arabic Dialect Identification NADI'2023, with a specific focus on the first subtask involving country-level dialect identification. Our investigation encompasses the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features. For classification purposes, we employ the Linear Support Vector Classification (LSVC) model. During the evaluation phase, our system demonstrates noteworthy results, achieving an F1 score of 62.51%. This achievement closely aligns with the average F1 scores attained by other systems submitted for the first subtask, which stands at 72.91%.

dialect identification, identification, proceedings, (10 more...)

arXiv.org Artificial Intelligence

2312.10536

Country:

Africa > Middle East > Algeria > Algiers Province > Algiers (0.06)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.05)
Asia > Middle East > Syria (0.05)
(9 more...)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through Dialect Identification using Transformer-based Approach

Deshpande, Vedant, Patwardhan, Yash, Deshpande, Kshitij, Mangalvedhekar, Sudeep, Murumkar, Ravindra

arXiv.org Artificial IntelligenceNov-30-2023

In this paper, we present our approach for the "Nuanced Arabic Dialect Identification (NADI) Shared Task 2023". We highlight our methodology for subtask 1 which deals with country-level dialect identification. Recognizing dialects plays an instrumental role in enhancing the performance of various downstream NLP tasks such as speech recognition and translation. The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem. Numerous transformer-based models, pre-trained on Arabic language, are employed for identifying country-level dialects. We fine-tune these state-of-the-art models on the provided dataset. The ensembling method is leveraged to yield improved performance of the system. We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.

computational linguistic, dataset, dialect identification, (9 more...)

arXiv.org Artificial Intelligence

2311.18739

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.06)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.05)
(6 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.62)

Add feedback

Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

Keleg, Amr, Magdy, Walid

arXiv.org Artificial IntelligenceOct-20-2023

Automatic Arabic Dialect Identification (ADI) of text has gained great popularity since it was introduced in the early 2010s. Multiple datasets were developed, and yearly shared tasks have been running since 2018. However, ADI systems are reported to fail in distinguishing between the micro-dialects of Arabic. We argue that the currently adopted framing of the ADI task as a single-label classification problem is one of the main reasons for that. We highlight the limitation of the incompleteness of the Dialect labels and demonstrate how it impacts the evaluation of ADI systems. A manual error analysis for the predictions of an ADI, performed by 7 native speakers of different Arabic dialects, revealed that $\approx$ 66% of the validated errors are not true errors. Consequently, we propose framing ADI as a multi-label classification task and give recommendations for designing new ADI datasets.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2310.13661

Country:

Africa > Middle East > Somalia (0.14)
Africa > Middle East > Djibouti (0.14)
Asia > Middle East > Jordan (0.06)
(49 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback