Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong
–arXiv.org Artificial Intelligence
The majority of inhabitants in Hong Kong are able to read and write in standard Chinese butuse Cantonese as theprimary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called writte n Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The riseof written Cantonese is increasingly evident in thecyber world.The growing interaction between Mandarin speakers and Cantonese sp eak-ers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chine se-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of thi s study is on the effort of preparing good amount of training dat a for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar senten ces from parallel articles on Chinese Wikipedia and Cantonese Wikip edia. We show that leveraging highly similar sentence pairs minedfrom Wikipedia improves translation performance in all test set s. Our system outperforms Baidu Fanyi's Chinese-to-Cantonese tr ansla-tion on 6 out of 8 test sets in BLEU scores. Translation exampl es reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.
arXiv.org Artificial Intelligence
May-26-2025
- Country:
- Asia
- China
- Guangxi Province (0.04)
- Hong Kong (1.00)
- Macao (0.04)
- Southeast Asia (0.04)
- China
- Europe > Switzerland (0.04)
- North America > United States
- New York > New York County > New York City (0.04)
- Asia
- Genre:
- Research Report (0.40)
- Industry:
- Media (0.46)
- Technology: