Abugida Normalizer and Parser for Unicode texts
Ansary, Nazmuddoha, Adib, Quazi Adibur Rahman, Reasat, Tahsin, Mehnaz, Sazia, Sushmit, Asif Shahriyar, Humayun, Ahmed Imtiaz, Rashid, Mohammad Mamun Or, Sadeque, Farig
–arXiv.org Artificial Intelligence
Unicode Normalization is a procedure for transforming Unicode text into different levels of equivalence, based on rules outlined by the Unicode Standard [1]. The goal is to ensure consistent treatment of certain types of text across applications and systems. Graphemes are the basic units of writing and include individual letters, symbols, or glyphs that convey meaning within a language system [2]. Each grapheme typically represents at least one phoneme or sound component of spoken language. Many speakers of Indian, Bangladeshi, and Thai languages utilize abugidas, also referred to as alphasyllabaries, consisting of over 1.3 billion individuals. Unfortunately, despite the large number of users, these languages encounter obstacles when it comes to natural language processing due to scarce resources and constraints in technology. Nevertheless, there exists vast academic and industrial interest in devising novel NLP techniques for these languages. Our group has developed a novel Indic Unicode Normalizer designed to overcome typical problems encountered in online Indic Abugida language datasets.
arXiv.org Artificial Intelligence
May-11-2023
- Country:
- Asia > Bangladesh
- Dhaka Division > Dhaka District > Dhaka (0.06)
- North America > United States (0.06)
- Asia > Bangladesh
- Genre:
- Research Report (0.50)
- Technology: