Goto

Collaborating Authors

 tunisian dialect


How Well Do LLMs Understand Tunisian Arabic?

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are the engines driving today's AI agents. The better these models understand human languages, the more natural and user-friendly the interaction with AI becomes, from everyday devices like computers and smartwatches to any tool that can act intelligently. Yet, the ability of industrial-scale LLMs to comprehend low-resource languages, such as Tunisian Arabic (Tunizi), is often overlooked. This neglect risks excluding millions of Tunisians from fully interacting with AI in their own language, pushing them toward French or English. Such a shift not only threatens the preservation of the Tunisian dialect but may also create challenges for literacy and influence younger generations to favor foreign languages. In this study, we introduce a novel dataset containing parallel Tunizi, standard Tunisian Arabic, and English translations, along with sentiment labels. We benchmark several popular LLMs on three tasks: transliteration, translation, and sentiment analysis. Our results reveal significant differences between models, highlighting both their strengths and limitations in understanding and processing Tunisian dialects. By quantifying these gaps, this work underscores the importance of including low-resource languages in the next generation of AI systems, ensuring technology remains accessible, inclusive, and culturally grounded.


TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English

arXiv.org Artificial Intelligence

In this paper, we introduce TEDxTN, the first publicly available Tunisian Arabic to English speech translation dataset. This work is in line with the ongoing effort to mitigate the data scarcity obstacle for a number of Arabic dialects. We collected, segmented, transcribed and translated 108 TEDx talks following our internally developed annotations guidelines. The collected talks represent 25 hours of speech with code-switching that cover speakers with various accents from over 11 different regions of Tunisia. We make the annotation guidelines and corpus publicly available. This will enable the extension of TEDxTN to new talks as they become available. We also report results for strong baseline systems of Speech Recognition and Speech Translation using multiple pre-trained and fine-tuned end-to-end models. This corpus is the first open source and publicly available speech translation corpus of Code-Switching Tunisian dialect. We believe that this is a valuable resource that can motivate and facilitate further research on the natural language processing of Tunisian Dialect.


Machine Translation: Tunisian Dialect -- English is it possible?

#artificialintelligence

In the context of a final Specialty project we were given a month to develop an idea that demonstrates our newly acquired knowledge and techniques.lesson Choosing Machine Learning among all the available specialties was a risky lesson synonymstep that we embarked and enjoyed it, mostly. Covering its mathematical and statistical concepts, supervised learning, unsupervised learning, reinforcement learning we had a wide range of sub-fields to work on but we considered two factors: What time allowed us to do and what area we felt most able to work on. That is how our choice was directed primarily to Natural Language Processing than w had decided what exactly to do? Last year we worked on a bedtime stories application that collected Tunisian folkloric stories and we were asked by one of the presentation jury if is it possible to implement a text to speech feature?