Abugida Normalizer and Parser for Unicode texts

Ansary, Nazmuddoha, Adib, Quazi Adibur Rahman, Reasat, Tahsin, Mehnaz, Sazia, Sushmit, Asif Shahriyar, Humayun, Ahmed Imtiaz, Rashid, Mohammad Mamun Or, Sadeque, Farig

May-11-2023–arXiv.org Artificial Intelligence

Unicode Normalization is a procedure for transforming Unicode text into different levels of equivalence, based on rules outlined by the Unicode Standard [1]. The goal is to ensure consistent treatment of certain types of text across applications and systems. Graphemes are the basic units of writing and include individual letters, symbols, or glyphs that convey meaning within a language system [2]. Each grapheme typically represents at least one phoneme or sound component of spoken language. Many speakers of Indian, Bangladeshi, and Thai languages utilize abugidas, also referred to as alphasyllabaries, consisting of over 1.3 billion individuals. Unfortunately, despite the large number of users, these languages encounter obstacles when it comes to natural language processing due to scarce resources and constraints in technology. Nevertheless, there exists vast academic and industrial interest in devising novel NLP techniques for these languages. Our group has developed a novel Indic Unicode Normalizer designed to overcome typical problems encountered in online Indic Abugida language datasets.

artificial intelligence, natural language, university, (13 more...)

arXiv.org Artificial Intelligence

May-11-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.06)
- Asia > Bangladesh
  - Dhaka Division > Dhaka District > Dhaka (0.06)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found