Graphemic Normalization of the Perso-Arabic Script
Doctor, Raiomond, Gutkin, Alexander, Johny, Cibu, Roark, Brian, Sproat, Richard
–arXiv.org Artificial Intelligence
Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions. This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages, such as Arabic and Persian, building on earlier work by the expert community. We particularly focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues such as the use of visually ambiguous yet canonically nonequivalent letters and the mixing of letters from different orthographies. Among the contributing conflating factors are the lack of input methods, the instability of modern orthographies, insufficient literacy, and loss or lack of orthographic tradition. We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks. Our results indicate statistically significant improvements in performance in most conditions for all the languages considered when normalization is applied. We argue that better understanding and representation of Perso-Arabic script variation within regional orthographic traditions, where those are present, is crucial for further progress of modern computational NLP techniques especially for languages with a paucity of resources.
arXiv.org Artificial Intelligence
Oct-31-2022
- Country:
- Oceania (0.04)
- North America
- Dominican Republic (0.04)
- United States
- New York (0.04)
- District of Columbia > Washington (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Illinois
- Cook County > Chicago (0.04)
- Champaign County > Urbana (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- California
- San Diego County > San Diego (0.04)
- Santa Clara County > Mountain View (0.04)
- Canada
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Europe
- Eastern Europe (0.04)
- Poland > Greater Poland Province
- Poznań (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Germany
- Berlin (0.04)
- Bavaria > Upper Bavaria
- Munich (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Russia > North Caucasian Federal District
- Republic of Dagestan (0.04)
- Netherlands > South Holland
- Leiden (0.04)
- Finland > Uusimaa
- Helsinki (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- United Kingdom
- Scotland > City of Edinburgh
- Edinburgh (0.04)
- England
- Oxfordshire > Oxford (0.14)
- Cambridgeshire > Cambridge (0.14)
- West Sussex (0.04)
- Greater London > London (0.04)
- Scotland > City of Edinburgh
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Azerbaijan (0.04)
- Southeast Asia (0.04)
- East Asia (0.04)
- Kazakhstan (0.04)
- Brunei (0.04)
- Afghanistan (0.04)
- Thailand (0.04)
- Turkmenistan (0.04)
- Kyrgyzstan (0.04)
- Russia (0.04)
- Uzbekistan (0.04)
- Indonesia > Sumatra (0.04)
- Nepal (0.04)
- India
- Tamil Nadu > Chennai (0.04)
- Punjab (0.04)
- NCT > Delhi (0.04)
- Jammu and Kashmir > Srinagar (0.04)
- Himachal Pradesh (0.04)
- Chhattisgarh > Raipur (0.04)
- China
- Xinjiang Uygur Autonomous Region (0.14)
- Hong Kong (0.04)
- Middle East
- Iran (0.04)
- Syria (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Republic of Türkiye
- Ordu Province > Ordu (0.04)
- Istanbul Province > Istanbul (0.04)
- Israel > Jerusalem District
- Jerusalem (0.04)
- Iraq > Kurdistan Region
- Duhok Governorate > Duhok (0.04)
- Malaysia > Kuala Lumpur
- Kuala Lumpur (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Pakistan
- Islamabad Capital Territory > Islamabad (0.04)
- Punjab > Lahore Division
- Lahore (0.04)
- Africa
- Southern Africa (0.04)
- Niger (0.04)
- Middle East > Morocco (0.04)
- Madagascar (0.04)
- Genre:
- Overview (1.00)
- Research Report
- New Finding (1.00)
- Experimental Study (0.94)
- Industry:
- Government (0.67)
- Information Technology > Security & Privacy (0.46)
- Technology: