AITopics | arabic script

Collaborating Authors

arabic script

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Neural Information Processing SystemsFeb-15-2026, 15:14:33 GMT

We present the Manuscripts of Handwritten Arabic (Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic.

machine learning, natural language, pattern recognition, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > North Carolina (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Iran (0.04)
(6 more...)

Genre: Research Report (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.83)

Add feedback

A Rhythm-Aware Phrase Insertion for Classical Arabic Poetry Composition

Elzohbi, Mohamad, Zhao, Richard

arXiv.org Artificial IntelligenceDec-9-2025

This paper presents a methodology for inserting phrases in Arabic poems to conform to a specific rhythm using ByT5, a byte-level multilingual transformer-based model. Our work discusses a rule-based grapheme-to-beat transformation tailored for extracting the rhythm from fully diacritized Arabic script. Our approach employs a conditional denoising objective to fine-tune ByT5, where the model reconstructs masked words to match a target rhythm. We adopt a curriculum learning strategy, pre-training on a general Arabic dataset before fine-tuning on poetic dataset, and explore cross-lingual transfer from English to Arabic. Experimental results demonstrate that our models achieve high rhythmic alignment while maintaining semantic coherence. The proposed model has the potential to be used in co-creative applications in the process of composing classical Arabic poems.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.arabicnlp-main.15

2509.18514

Country: North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.15)

Genre:

Research Report > New Finding (0.34)
Overview > Innovation (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

How Well Do LLMs Understand Tunisian Arabic?

Mahdi, Mohamed

arXiv.org Artificial IntelligenceNov-24-2025

Large Language Models (LLMs) are the engines driving today's AI agents. The better these models understand human languages, the more natural and user-friendly the interaction with AI becomes, from everyday devices like computers and smartwatches to any tool that can act intelligently. Yet, the ability of industrial-scale LLMs to comprehend low-resource languages, such as Tunisian Arabic (Tunizi), is often overlooked. This neglect risks excluding millions of Tunisians from fully interacting with AI in their own language, pushing them toward French or English. Such a shift not only threatens the preservation of the Tunisian dialect but may also create challenges for literacy and influence younger generations to favor foreign languages. In this study, we introduce a novel dataset containing parallel Tunizi, standard Tunisian Arabic, and English translations, along with sentiment labels. We benchmark several popular LLMs on three tasks: transliteration, translation, and sentiment analysis. Our results reveal significant differences between models, highlighting both their strengths and limitations in understanding and processing Tunisian dialects. By quantifying these gaps, this work underscores the importance of including low-resource languages in the next generation of AI systems, ensuring technology remains accessible, inclusive, and culturally grounded.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.16683

Country: Africa > Middle East > Tunisia (0.15)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

6b8cb6b291045e217c3ff3f854f2fd0f-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 05:09:34 GMT

dataset, muharaf dataset, transcription, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > North Carolina (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Iran (0.04)
(6 more...)

Genre: Research Report (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)

Add feedback

Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

Mamasaidov, Mukhammadsaid, Aral, Azizullah, Shopulatov, Abror, Inomjonov, Mironshoh

arXiv.org Artificial IntelligenceAug-21-2025

Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.

artificial intelligence, natural language, southern uzbek, (16 more...)

arXiv.org Artificial Intelligence

2508.14586

Country:

North America > United States (0.94)
Asia > Afghanistan (0.70)

Genre: Research Report (0.40)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

Shang, Guokan, Abdine, Hadi, Chamma, Ahmad, Mohamed, Amr, Anwar, Mohamed, Bounhar, Abdelaziz, Herraoui, Omar El, Nakov, Preslav, Vazirgiannis, Michalis, Xing, Eric

arXiv.org Artificial IntelligenceJul-8-2025

We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.04569

Country:

Africa > Middle East > Egypt (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(7 more...)

Genre: Research Report (0.82)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

Wasfy, Ahmed, Nacar, Omer, Elkhateb, Abdelakreem, Reda, Mahmoud, Elshehy, Omar, Ammar, Adel, Boulila, Wadii

arXiv.org Artificial IntelligenceJun-4-2025

The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.

machine learning, pattern recognition, recognition, (19 more...)

arXiv.org Artificial Intelligence

2506.02295

Country: Asia (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.42)

Add feedback

HATFormer: Historic Handwritten Arabic Text Recognition with Transformers

Chan, Adrian, Mijar, Anupam, Saeed, Mehreen, Wong, Chau-Wai, Khater, Akram

arXiv.org Artificial IntelligenceOct-2-2024

Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer's attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.

dataset, international conference, recognition, (14 more...)

arXiv.org Artificial Intelligence

2410.02179

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States (0.04)
Europe > Middle East (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Handwriting Recognition (0.88)

Add feedback

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

Bhatia, Gagan, Nagoudi, El Moatez Billah, Alwajih, Fakhraddin, Abdul-Mageed, Muhammad

arXiv.org Artificial IntelligenceJul-18-2024

Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces Qalam, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train Qalam on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, Qalam demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore Qalam's potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.

dataset, qalam, recognition, (12 more...)

arXiv.org Artificial Intelligence

2407.13559

Country:

North America > United States (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada > British Columbia (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Vision > Handwriting Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus

Khiarak, Jalil Nourmohammadi, Ahmadi, Ammar, Saeed, Taher Ak-bari, Asgari-Chenaghlu, Meysam, Atabay, Toğrul, Karimi, Mohammad Reza Baghban, Ceferli, Ismail, Hasanvand, Farzad, Mousavi, Seyed Mahboub, Noshad, Morteza

arXiv.org Artificial IntelligenceJul-6-2024

This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus, designed to bridge the technological gap in language learning and machine translation (MT) for under-resourced languages. Consisting of 548,000 parallel sentences and approximately 9 million words per language, this dataset is derived from diverse sources such as news articles and holy texts, aiming to enhance natural language processing (NLP) applications and language education technology. This corpus marks a significant step forward in the realm of linguistic resources, particularly for Turkic languages, which have lagged in the neural machine translation (NMT) revolution. By presenting the first comprehensive case study for the English-Azerbaijani (Arabic Script) language pair, this work underscores the transformative potential of NMT in low-resource contexts. The development and utilization of this corpus not only facilitate the advancement of machine translation systems tailored for specific linguistic needs but also promote inclusive language learning through technology. The findings demonstrate the corpus's effectiveness in training deep learning MT systems and underscore its role as an essential asset for researchers and educators aiming to foster bilingual education and multilingual communication. This research covers the way for future explorations into NMT applications for languages lacking substantial digital resources, thereby enhancing global language education frameworks. The Python package of our code is available at https://pypi.org/project/chevir-kartalol/, and we also have a website accessible at https://translate.kartalol.com/.

arabic script, azerbaijani, translation, (14 more...)

arXiv.org Artificial Intelligence

2407.05189

Country:

Asia > Azerbaijan (0.05)
Asia > Middle East > Iran > Ardabil Province > Ardabil (0.05)
North America > United States > Michigan (0.04)
(8 more...)

Genre: Research Report > New Finding (0.88)

Industry: Education > Curriculum > Subject-Specific Education (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback