Kheir, Yassine El
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
Asgari, Ehsaneddin, Kheir, Yassine El, Javaheri, Mohammad Ali Sadraei
Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme boundaries, leading to suboptimal segmentation, particularly in morphologically rich languages. We introduce MorphBPE, a morphology-aware extension of BPE that integrates linguistic structure into subword tokenization while preserving statistical efficiency. Additionally, we propose two morphology-based evaluation metrics: (i) Morphological Consistency F1-Score, which quantifies the consistency between morpheme sharing and token sharing, contributing to LLM training convergence, and (ii) Morphological Edit Distance, which measures alignment between morphemes and tokens concerning interpretability. Experiments on English, Russian, Hungarian, and Arabic across 300M and 1B parameter LLMs demonstrate that MorphBPE consistently reduces cross-entropy loss, accelerates convergence, and improves morphological alignment scores. Fully compatible with existing LLM pipelines, MorphBPE requires minimal modifications for integration. The MorphBPE codebase and tokenizer playground will be available at: https://github.com/llm-lab-org/MorphBPE and https://tokenizer.llm-lab.org
Fanar: An Arabic-Centric Multimodal Generative AI Platform
Fanar Team, null, Abbas, Ummar, Ahmad, Mohammad Shahmeer, Alam, Firoj, Altinisik, Enes, Asgari, Ehsannedin, Boshmaf, Yazan, Boughorbel, Sabri, Chawla, Sanjay, Chowdhury, Shammur, Dalvi, Fahim, Darwish, Kareem, Durrani, Nadir, Elfeky, Mohamed, Elmagarmid, Ahmed, Eltabakh, Mohamed, Fatehkia, Masoomali, Fragkopoulos, Anastasios, Hasanain, Maram, Hawasly, Majd, Husaini, Mus'ab, Jung, Soon-Gyo, Lucas, Ji Kim, Magdy, Walid, Messaoud, Safa, Mohamed, Abubakr, Mohiuddin, Tasnim, Mousi, Basel, Mubarak, Hamdy, Musleh, Ahmad, Naeem, Zan, Ouzzani, Mourad, Popovic, Dorde, Sadeghi, Amin, Sencar, Husrev Taha, Shinoy, Mohammed, Sinan, Omar, Zhang, Yifan, Ali, Ahmed, Kheir, Yassine El, Ma, Xiaosong, Ruan, Chaoyi
We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that supports language, speech and image generation tasks. At the heart of Fanar are Fanar Star and Fanar Prime, two highly capable Arabic Large Language Models (LLMs) that are best in the class on well established benchmarks for similar sized models. Fanar Star is a 7B (billion) parameter model that was trained from scratch on nearly 1 trillion clean and deduplicated Arabic, English and Code tokens. Fanar Prime is a 9B parameter model continually trained on the Gemma-2 9B base model on the same 1 trillion token set. Both models are concurrently deployed and designed to address different types of prompts transparently routed through a custom-built orchestrator. The Fanar platform provides many other capabilities including a customized Islamic Retrieval Augmented Generation (RAG) system for handling religious prompts, a Recency RAG for summarizing information about current or recent events that have occurred after the pre-training data cut-off date. The platform provides additional cognitive capabilities including in-house bilingual speech recognition that supports multiple Arabic dialects, voice and image generation that is fine-tuned to better reflect regional characteristics. Finally, Fanar provides an attribution service that can be used to verify the authenticity of fact based generated content. The design, development, and implementation of Fanar was entirely undertaken at Hamad Bin Khalifa University's Qatar Computing Research Institute (QCRI) and was sponsored by Qatar's Ministry of Communications and Information Technology to enable sovereign AI technology development.
CAFE A Novel Code switching Dataset for Algerian Dialect French and English
Lachemat, Houssam Eddine-Othman, Abbas, Akli, Oukas, Nourredine, Kheir, Yassine El, Haboussi, Samia, Shammur, Absar Chowdhury
The paper introduces and publicly releases (Data download link available after acceptance) CAFE -- the first Code-switching dataset between Algerian dialect, French, and english languages. The CAFE speech data is unique for (a) its spontaneous speaking style in vivo human-human conversation capturing phenomena like code-switching and overlapping speech, (b) addresses distinct linguistic challenges in North African Arabic dialect; (c) the CAFE captures dialectal variations from various parts of Algeria within different sociolinguistic contexts. CAFE data contains approximately 37 hours of speech, with a subset, CAFE-small, of 2 hours and 36 minutes released with manual human annotation including speech segmentation, transcription, explicit annotation of code-switching points, overlapping speech, and other events such as noises, and laughter among others. The rest approximately 34.58 hours contain pseudo label transcriptions. In addition to the data release, the paper also highlighted the challenges of using state-of-the-art Automatic Speech Recognition (ASR) models such as Whisper large-v2,3 and PromptingWhisper to handle such content. Following, we benchmark CAFE data with the aforementioned Whisper models and show how well-designed data processing pipelines and advanced decoding techniques can improve the ASR performance in terms of Mixed Error Rate (MER) of 0.310, Character Error Rate (CER) of 0.329 and Word Error Rate (WER) of 0.538.
Automatic Pronunciation Assessment -- A Review
Kheir, Yassine El, Ali, Ahmed, Chowdhury, Shammur Absar
Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.
L1-aware Multilingual Mispronunciation Detection Framework
Kheir, Yassine El, Chowdhury, Shammur Absar, Ali, Ahmed
The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechanism is deployed to align the input audio with the reference phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an auxiliary model, pretrained in a multi-task setup identifying L1 and L2 language, and are infused with the primary network. Finally, the L1-MultiMDD is then optimized for a unified multilingual phoneme recognition task using connectionist temporal classification (CTC) loss for the target languages: English, Arabic, and Mandarin. Our experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets. The consistent gains in PER, and false rejection rate (FRR) across all target languages confirm our approach's robustness, efficacy, and generalizability.
The complementary roles of non-verbal cues for Robust Pronunciation Assessment
Kheir, Yassine El, Chowdhury, Shammur Absar, Ali, Ahmed
Numerous investigations have explored a range of features and modeling approaches aimed at enhancing modeling Research on pronunciation assessment systems focuses performance. These explorations have encompassed the utilization on utilizing phonetic and phonological aspects of non-native of Goodness-of-Pronunciation (GOP) metrics [4, 5, (L2) speech, often neglecting the rich layer of information 6], the integration of manually crafted handful of non-verbal hidden within the non-verbal cues. In this study, we proposed features such as duration, energy, and pitch [7, 8, 9], as well a novel pronunciation assessment framework, IntraVerbalPA.
MyVoice: Arabic Speech Resource Collaboration Platform
Elshahawy, Yousseif, Kheir, Yassine El, Chowdhury, Shammur Absar, Ali, Ahmed
We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and annotators. The platform incorporates a quality assurance system that filters out low-quality and spurious recordings before sending them for validation. During the validation phase, contributors can assess the quality of recordings, annotate them, and provide feedback which is then reviewed by administrators. Furthermore, the platform offers flexibility to admin roles to add new data or tasks beyond dialectal speech and word collection, which are displayed to contributors. Thus, enabling collaborative efforts in gathering diverse and large Arabic speech data.
SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation
Kheir, Yassine El, Chowdhury, Shammur Absar, Ali, Ahmed, Mubarak, Hamdy, Afzal, Shazia
The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly interpolate raw speech signals while augmenting pronunciation. The masks facilitate smooth blending of the signals, generating more effective samples than the `Cut/Paste' method. Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [1]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. We also observed a 4.6% increase in F1-score with Arabic AraVoiceL2 testset.
Benchmarking Arabic AI with Large Language Models
Abdelali, Ahmed, Mubarak, Hamdy, Chowdhury, Shammur Absar, Hasanain, Maram, Mousi, Basel, Boughorbel, Sabri, Kheir, Yassine El, Izham, Daniel, Dalvi, Fahim, Hawasly, Majd, Nazar, Nizi, Elshahawy, Yousseif, Ali, Ahmed, Durrani, Nadir, Milic-Frayling, Natasa, Alam, Firoj
With large Foundation Models (FMs), language technologies (AI in general) are entering a new paradigm: eliminating the need for developing large-scale task-specific datasets and supporting a variety of tasks through set-ups ranging from zero-shot to few-shot learning. However, understanding FMs capabilities requires a systematic benchmarking effort by comparing FMs performance with the state-of-the-art (SOTA) task-specific models. With that goal, past work focused on the English language and included a few efforts with multiple languages. Our study contributes to ongoing research by evaluating FMs performance for standard Arabic NLP and Speech processing, including a range of tasks from sequence tagging to content classification across diverse domains. We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM, addressing 33 unique tasks using 59 publicly available datasets resulting in 96 test setups. For a few tasks, FMs performs on par or exceeds the performance of the SOTA models but for the majority it under-performs. Given the importance of prompt for the FMs performance, we discuss our prompt strategies in detail and elaborate on our findings. Our future work on Arabic AI will explore few-shot prompting, expand the range of tasks, and investigate additional open-source models.
QVoice: Arabic Speech Pronunciation Learning Application
Kheir, Yassine El, Khnaisser, Fouad, Chowdhury, Shammur Absar, Mubarak, Hamdy, Afzal, Shazia, Ali, Ahmed
This paper introduces a novel Arabic pronunciation learning application QVoice, powered with end-to-end mispronunciation detection and feedback generator module. The application is designed to support non-native Arabic speakers in enhancing their pronunciation skills, while also helping native speakers mitigate any potential influence from regional dialects on their Modern Standard Arabic (MSA) pronunciation. QVoice employs various learning cues to aid learners in comprehending meaning, drawing connections with their existing knowledge of English language, and offers detailed feedback for pronunciation correction, along with contextual examples showcasing word usage. The learning cues featured in QVoice encompass a wide range of meaningful information, such as visualizations of phrases/words and their translations, as well as phonetic transcriptions and transliterations. QVoice provides pronunciation feedback at the character level and assesses performance at the word level.