CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models Using Synthetic Back-Translation Data

Hong, Kung Yin, Han, Lifeng, Batista-Navarro, Riza, Nenadic, Goran

Jun-9-2024–arXiv.org Artificial Intelligence

Neural Machine Translation (NMT) for low-resource languages is still a challenging task in front of NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation including OpusMT, NLLB, and mBART. We carried out automatic evaluation using a range of different metrics including lexical-based and embedding-based. Furthermore. we create a user-friendly interface for the models we included in this\textsc{ CantonMT} research project and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models into this platform via our open-source\textsc{ CantonMT} toolkit \url{https://github.com/kenrickkung/CantoneseTranslation}.

cantonese, synthetic data, translation, (13 more...)

arXiv.org Artificial Intelligence

Jun-9-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County > New York City (0.04)
- Europe
  - United Kingdom (0.14)
  - France > Provence-Alpes-Côte d'Azur
    - Alpes-Maritimes > Nice (0.04)
  - Bulgaria > Sofia City Province
    - Sofia (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Macao (0.14)
  - Middle East > UAE (0.04)
  - China
    - Hong Kong (0.05)
    - Guangdong Province > Guangzhou (0.04)
    - Beijing > Beijing (0.04)

Genre:
- Research Report (0.50)

Industry:
- Health & Medicine (0.68)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found