Goto

Collaborating Authors

 diacritization


A Rhythm-Aware Phrase Insertion for Classical Arabic Poetry Composition

Elzohbi, Mohamad, Zhao, Richard

arXiv.org Artificial Intelligence

This paper presents a methodology for inserting phrases in Arabic poems to conform to a specific rhythm using ByT5, a byte-level multilingual transformer-based model. Our work discusses a rule-based grapheme-to-beat transformation tailored for extracting the rhythm from fully diacritized Arabic script. Our approach employs a conditional denoising objective to fine-tune ByT5, where the model reconstructs masked words to match a target rhythm. We adopt a curriculum learning strategy, pre-training on a general Arabic dataset before fine-tuning on poetic dataset, and explore cross-lingual transfer from English to Arabic. Experimental results demonstrate that our models achieve high rhythmic alignment while maintaining semantic coherence. The proposed model has the potential to be used in co-creative applications in the process of composing classical Arabic poems.


Hebrew Diacritics Restoration using Visual Representation

Elboher, Yair, Pinter, Yuval

arXiv.org Artificial Intelligence

Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input's vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system's overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.


Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech

Kolani, Yakov, Melichov, Maxim, Calev, Cobi, Alper, Morris

arXiv.org Artificial Intelligence

Real-time text-to-speech (TTS) for Modern Hebrew is challenging due to the language's orthographic complexity. Existing solutions ignore crucial phonetic features such as stress that remain underspecified even when vowel marks are added. To address these limitations, we introduce Phonikud, a lightweight, open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified IPA transcriptions. Our approach adapts an existing diacritization model with lightweight adaptors, incurring negligible additional latency. We also contribute the ILSpeech dataset of transcribed Hebrew speech with IPA annotations, serving as a benchmark for Hebrew G2P, as training data for TTS systems, and enabling audio-to-IPA for evaluating TTS performance while capturing important phonetic details. Our results demonstrate that Phonikud G2P conversion more accurately predicts phonemes from Hebrew text compared to prior methods, and that this enables training of effective real-time Hebrew TTS models with superior speed-accuracy trade-offs. We release our code, data, and models at https: //phonikud.github.io.


Rezwan: Leveraging Large Language Models for Comprehensive Hadith Text Processing: A 1.2M Corpus Development

Asgari-Bidhendi, Majid, Ghaseminia, Muhammad Amin, Shahbazi, Alireza, Hossayni, Sayyed Ali, Torabian, Najmeh, Minaei-Bidgoli, Behrouz

arXiv.org Artificial Intelligence

This paper presents the development of Rezwan, a large-scale AI-assisted Hadith corpus comprising over 1.2M narrations, extracted and structured through a fully automated pipeline. Building on digital repositories such as Maktabat Ahl al-Bayt, the pipeline employs Large Language Models (LLMs) for segmentation, chain--text separation, validation, and multi-layer enrichment. Each narration is enhanced with machine translation into twelve languages, intelligent diacritization, abstractive summarization, thematic tagging, and cross-text semantic analysis. This multi-step process transforms raw text into a richly annotated research-ready infrastructure for digital humanities and Islamic studies. A rigorous evaluation was conducted on 1,213 randomly sampled narrations, assessed by six domain experts. Results show near-human accuracy in structured tasks such as chain--text separation (9.33/10) and summarization (9.33/10), while highlighting ongoing challenges in diacritization and semantic similarity detection. Comparative analysis against the manually curated Noor Corpus demonstrates the superiority of Najm in both scale and quality, with a mean overall score of 8.46/10 versus 3.66/10. Furthermore, cost analysis confirms the economic feasibility of the AI approach: tasks requiring over 229,000 hours of expert labor were completed within months at a fraction of the cost. The work introduces a new paradigm in religious text processing by showing how AI can augment human expertise, enabling large-scale, multilingual, and semantically enriched access to Islamic heritage.


Unlocking the Potential of Arabic Voice-Generation Technologies

Communications of the ACM

Membership in ACM includes a subscription to Communications of the ACM (CACM), the computing industry's most trusted source for staying connected to the world of advanced computing. Addressing linguistic complexities, the scarcity of high-quality datasets, and other challenges is crucial for advancing Arabic text-to-speech technology. Voice-generation technology enables machines to synthesize human-like speech--text-to-speech (TTS)--revolutionizing digital communication by fostering more inclusive and accessible experiences. What began as simple robotic speech synthesis has evolved into highly sophisticated voice-cloning systems that can produce natural, coherent, expressive, and personalized voices using minimal data. These technologies empower individuals with cross-lingual communication through virtual agents, assist in overcoming visual or speech impairments or literacy challenges via assistive tools, and support educators and industries such as entertainment with creative content generation.


Sadeed: Advancing Arabic Diacritization Through Small Language Model

Aldallal, Zeina, Chrouf, Sara, Hennara, Khalil, Hamed, Mohamed Motaism, Hreden, Muhammad, AlModhayan, Safwan

arXiv.org Artificial Intelligence

Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.


Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset

Bondok, Rawan, Nassar, Mayar, Khalifa, Salam, Micallef, Kurt, Habash, Nizar

arXiv.org Artificial Intelligence

Proper nouns in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP, their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper noun diacritization.


Are LLMs Good Text Diacritizers? An Arabic and Yorùbá Case Study

Toyin, Hawau Olamide, Magdy, Samar M., Aldarmaki, Hanan

arXiv.org Artificial Intelligence

We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 14 LLMs varying in size, accessibility, and language coverage, and benchmark them against 6 specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models for both Arabic and Yoruba, but smaller models suffer from hallucinations. Fine-tuning on a small dataset can help improve diacritization performance and reduce hallucination rates.


YAD: Leveraging T5 for Improved Automatic Diacritization of Yor\`ub\'a Text

Olawole, Akindele Michael, Alabi, Jesujoba O., Sakpere, Aderonke Busayo, Adelani, David I.

arXiv.org Artificial Intelligence

In addition, we pre-train text-to-text transformer, T5 model for Yorùbá and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yorùbá Introduction Yorùbá, a language spoken predominantly in West Africa, is renowned for its tonal nature which is characterized by a heavy use of diacritics to signify tone variations. In Yorùbá and many other languages, diacritics play a crucial role in disambiguating word meanings and in word pronunciation, making accurate diacritization essential for effective communication and language processing tasks (Skiredj & Berrada, 2024). However, manual diacritization is time-consuming and requires specialized linguistic expertise, motivating the development of automatic diacritization systems. In recent years, significant progress has been made in natural language processing (NLP) techniques, leading to the exploration of various approaches to automate the diacritization process for languages using diacritics (Náplava et al., 2018; Mubarak et al., 2019; Náplava et al., 2021; Stankevicius et al., 2022, inter alia) including Yorùbá (Orife, 2018; Orife et al., 2020).


CATT: Character-based Arabic Tashkeel Transformer

Alasmary, Faris, Zaafarani, Orjuwan, Ghannam, Ahmad

arXiv.org Artificial Intelligence

Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence. It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation. This paper introduces a new approach to training ATD models. First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT. Then, we applied the Noisy-Student approach to boost the performance of the best model. We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset. Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83\% and 35.21\% on WikiNews and CATT, respectively, achieving state-of-the-art in ATD. In addition, we show that our model outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36\%. We open-source our CATT models and benchmark dataset for the research community\footnote{https://github.com/abjadai/catt}.