Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset
Bondok, Rawan, Nassar, Mayar, Khalifa, Salam, Micallef, Kurt, Habash, Nizar
–arXiv.org Artificial Intelligence
Proper nouns in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP, their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper noun diacritization.
arXiv.org Artificial Intelligence
Jun-24-2025
- Country:
- Africa
- Middle East > Egypt
- Cairo Governorate > Cairo (0.04)
- Sudan (0.04)
- Middle East > Egypt
- Asia
- China
- Hong Kong (0.04)
- Tibet Autonomous Region (0.04)
- Middle East
- Oman (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Thailand > Bangkok
- Bangkok (0.04)
- China
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Middle East > Malta (0.04)
- Netherlands (0.04)
- Spain (0.04)
- Ukraine > Kyiv Oblast
- Kyiv (0.04)
- United Kingdom > Scotland
- City of Edinburgh > Edinburgh (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America > United States
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New York > Suffolk County
- Stony Brook (0.04)
- Ohio > Franklin County
- Columbus (0.04)
- Michigan > Washtenaw County
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Victoria > Melbourne (0.04)
- South America > Uruguay (0.04)
- Africa
- Genre:
- Research Report > New Finding (0.68)
- Technology: