ADI-20: Arabic Dialect Identification dataset and models
Elleuch, Haroun, Mdhaffar, Salima, Estève, Yannick, Bougares, Fethi
–arXiv.org Artificial Intelligence
We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.
arXiv.org Artificial Intelligence
Nov-14-2025
- Country:
- Africa > Middle East (0.14)
- Genre:
- Research Report > New Finding (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Speech > Speech Recognition (1.00)
- Natural Language (1.00)
- Machine Learning (1.00)
- Information Technology > Artificial Intelligence