Goto

Collaborating Authors

 Sabty, Caroline


A Survey of Code-switched Arabic NLP: Progress, Challenges, and Future Directions

arXiv.org Artificial Intelligence

Language in the Arab world presents a complex diglossic and multilingual setting, involving the use of Modern Standard Arabic, various dialects and sub-dialects, as well as multiple European languages. This diverse linguistic landscape has given rise to code-switching, both within Arabic varieties and between Arabic and foreign languages. The widespread occurrence of code-switching across the region makes it vital to address these linguistic needs when developing language technologies. In this paper, we provide a review of the current literature in the field of code-switched Arabic NLP, offering a broad perspective on ongoing efforts, challenges, research gaps, and recommendations for future research directions.


Computational Approaches to Arabic-English Code-Switching

arXiv.org Artificial Intelligence

Natural Language Processing (NLP) is a vital computational method for addressing language processing, analysis, and generation. NLP tasks form the core of many daily applications, from automatic text correction to speech recognition. While significant research has focused on NLP tasks for the English language, less attention has been given to Modern Standard Arabic and Dialectal Arabic. Globalization has also contributed to the rise of Code-Switching (CS), where speakers mix languages within conversations and even within individual words (intra-word CS). This is especially common in Arab countries, where people often switch between dialects or between dialects and a foreign language they master. CS between Arabic and English is frequent in Egypt, especially on social media. Consequently, a significant amount of code-switched content can be found online. Such code-switched data needs to be investigated and analyzed for several NLP tasks to tackle the challenges of this multilingual phenomenon and Arabic language challenges. No work has been done before for several integral NLP tasks on Arabic-English CS data. In this work, we focus on the Named Entity Recognition (NER) task and other tasks that help propose a solution for the NER task on CS data, e.g., Language Identification. This work addresses this gap by proposing and applying state-of-the-art techniques for Modern Standard Arabic and Arabic-English NER. We have created the first annotated CS Arabic-English corpus for the NER task. Also, we apply two enhancement techniques to improve the NER tagger on CS data using CS contextual embeddings and data augmentation techniques. All methods showed improvements in the performance of the NER taggers on CS data. Finally, we propose several intra-word language identification approaches to determine the language type of a mixed text and identify whether it is a named entity or not.