Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language

Dossou, Bonaventure F. P., Emezue, Chris C.

Mar-17-2021–arXiv.org Artificial Intelligence

Building effective neural machine translation (NMT) models for very low-resourced and morphologically rich African indigenous languages is an open challenge. Besides the issue of finding available resources for them, a lot of work is put into preprocessing and tokenization. Recent studies have shown that standard tokenization methods do not always adequately deal with the grammatical, diacritical, and tonal properties of some African languages. That, coupled with the extremely low availability of training samples, hinders the production of reliable NMT models. In this paper, using Fon language as a case study, we revisit standard tokenization methods and introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training. Furthermore, we compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.

expression, machine translation, translation, (13 more...)

arXiv.org Artificial Intelligence

Mar-17-2021

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- North America > United States
  - Michigan (0.04)
  - Texas > Travis County
    - Austin (0.04)
  - Louisiana > Orleans Parish
    - New Orleans (0.04)
  - California > San Diego County
    - San Diego (0.04)
- Europe
  - Belgium (0.05)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Germany
    - Berlin (0.05)
    - Bavaria > Upper Bavaria
      - Munich (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
- Asia
  - Indonesia > Bali (0.04)
  - China > Hong Kong (0.04)
  - Vietnam > Thái Nguyên Province
    - Thái Nguyên (0.04)
- Africa
  - Benin (0.04)
  - Togo (0.04)
  - Nigeria (0.04)
  - Ghana (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found