The Development of a Labelled te reo M\=aori-English Bilingual Database for Language Technology
James, Jesin, Shields, Isabella, Yogarajan, Vithya, Keegan, Peter J., Watson, Catherine, Jones, Peter-Lucas, Mahelona, Keoni
–arXiv.org Artificial Intelligence
Te reo M\=aori (referred to as M\=aori), New Zealand's indigenous language, is under-resourced in language technology. M\=aori speakers are bilingual, where M\=aori is code-switched with English. Unfortunately, there are minimal resources available for M\=aori language technology, language detection and code-switch detection between M\=aori-English pair. Both English and M\=aori use Roman-derived orthography making rule-based systems for detecting language and code-switching restrictive. Most M\=aori language detection is done manually by language experts. This research builds a M\=aori-English bilingual database of 66,016,807 words with word-level language annotation. The New Zealand Parliament Hansard debates reports were used to build the database. The language labels are assigned using language-specific rules and expert manual annotations. Words with the same spelling, but different meanings, exist for M\=aori and English. These words could not be categorised as M\=aori or English based on word-level language rules. Hence, manual annotations were necessary. An analysis reporting the various aspects of the database such as metadata, year-wise analysis, frequently occurring words, sentence length and N-grams is also reported. The database developed here is a valuable tool for future language and speech technology development for Aotearoa New Zealand. The methodology followed to label the database can also be followed by other low-resourced language pairs.
arXiv.org Artificial Intelligence
Aug-20-2022
- Country:
- Europe
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Sweden > Uppsala County
- Uppsala (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- Spain > Catalonia
- Oceania
- Cook Islands (0.04)
- Kiribati (0.04)
- New Zealand > North Island
- Auckland Region > Auckland (0.05)
- Waikato (0.04)
- Wellington Region > Wellington (0.04)
- Pacific Ocean (0.04)
- Europe
- Genre:
- Research Report (0.82)
- Industry:
- Government (0.68)
- Technology: