A Clustering Framework for Lexical Normalization of Roman Urdu
Khan, Abdul Rafae, Karim, Asim, Sajjad, Hassan, Kamiran, Faisal, Xu, Jia
–arXiv.org Artificial Intelligence
Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.
arXiv.org Artificial Intelligence
Mar-31-2020
- Country:
- Africa > Middle East
- Egypt > Giza Governorate > Giza (0.05)
- Asia
- China > Beijing
- Beijing (0.04)
- India
- Karnataka > Bengaluru (0.04)
- West Bengal > Kharagpur (0.04)
- Japan > Honshū
- Kansai > Osaka Prefecture > Osaka (0.04)
- Middle East > Qatar
- Pakistan > Punjab
- Lahore Division > Lahore (0.04)
- South Korea (0.04)
- Thailand > Chiang Mai
- Chiang Mai (0.04)
- China > Beijing
- Europe
- United Kingdom
- England > Cambridgeshire
- Cambridge (0.04)
- Scotland > City of Edinburgh
- Edinburgh (0.04)
- England > Cambridgeshire
- Sweden
- Stockholm > Stockholm (0.04)
- Uppsala County > Uppsala (0.04)
- Germany > Hesse
- Darmstadt Region > Darmstadt (0.04)
- Czechia > South Moravian Region
- Brno (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Greece > Attica
- Athens (0.04)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Spain > Andalusia
- Granada Province > Granada (0.04)
- Poland > Greater Poland Province
- Poznań (0.04)
- United Kingdom
- North America
- Canada > Quebec
- Montreal (0.04)
- United States
- New Jersey > Hudson County
- Hoboken (0.04)
- California
- Los Angeles County > Los Angeles (0.14)
- San Francisco County > San Francisco (0.14)
- Washington > King County
- Seattle (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Oregon > Multnomah County
- Portland (0.04)
- Maryland
- Baltimore (0.04)
- Howard County > Columbia (0.04)
- Montgomery County > Gaithersburg (0.04)
- New York
- Monroe County > Rochester (0.04)
- New York County > New York City (0.04)
- Colorado > Denver County
- Denver (0.04)
- Texas > Dallas County
- Dallas (0.04)
- New Jersey > Hudson County
- Canada > Quebec
- South America > Brazil
- Paraná (0.04)
- Africa > Middle East
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology (0.46)
- Technology: