A Clustering Framework for Lexical Normalization of Roman Urdu
Khan, Abdul Rafae, Karim, Asim, Sajjad, Hassan, Kamiran, Faisal, Xu, Jia
–arXiv.org Artificial Intelligence
Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.
arXiv.org Artificial Intelligence
Mar-31-2020
- Country:
- South America > Brazil
- Paraná (0.04)
- North America
- United States
- Texas > Dallas County
- Dallas (0.04)
- Colorado > Denver County
- Denver (0.04)
- New York
- New York County > New York City (0.04)
- Monroe County > Rochester (0.04)
- Maryland
- Baltimore (0.04)
- Montgomery County > Gaithersburg (0.04)
- Howard County > Columbia (0.04)
- Oregon > Multnomah County
- Portland (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Washington > King County
- Seattle (0.04)
- California
- San Francisco County > San Francisco (0.14)
- Los Angeles County > Los Angeles (0.14)
- New Jersey > Hudson County
- Hoboken (0.04)
- Texas > Dallas County
- Canada > Quebec
- Montreal (0.04)
- United States
- Europe
- Poland > Greater Poland Province
- Poznań (0.04)
- Spain > Andalusia
- Granada Province > Granada (0.04)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Greece > Attica
- Athens (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Czechia > South Moravian Region
- Brno (0.04)
- Germany > Hesse
- Darmstadt Region > Darmstadt (0.04)
- Sweden
- Uppsala County > Uppsala (0.04)
- Stockholm > Stockholm (0.04)
- United Kingdom
- Scotland > City of Edinburgh
- Edinburgh (0.04)
- England > Cambridgeshire
- Cambridge (0.04)
- Scotland > City of Edinburgh
- Poland > Greater Poland Province
- Asia
- South Korea (0.04)
- Thailand > Chiang Mai
- Chiang Mai (0.04)
- Pakistan > Punjab
- Lahore Division > Lahore (0.04)
- Middle East > Qatar
- Japan > Honshū
- Kansai > Osaka Prefecture > Osaka (0.04)
- India
- West Bengal > Kharagpur (0.04)
- Karnataka > Bengaluru (0.04)
- China > Beijing
- Beijing (0.04)
- Africa > Middle East
- Egypt > Giza Governorate > Giza (0.05)
- South America > Brazil
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology (0.46)
- Technology: