High-Quality Data Augmentation for Low-Resource NMT: Combining a Translation Memory, a GAN Generator, and Filtering
Liu, Hengjie, Hou, Ruibo, Lepage, Yves
–arXiv.org Artificial Intelligence
Back translation, as a technique for extending a dataset, is widely used by researchers in low-resource language translation tasks. It typically translates from the target to the source language to ensure high-quality translation results. This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings. We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator. Additionally, this paper integrates Translation Memory (TM) with NMT, increasing the amount of data available to the generator. Moreover, we propose a novel procedure to filter the synthetic sentence pairs during the augmentation process, ensuring the high quality of the data.
arXiv.org Artificial Intelligence
Aug-21-2024
- Country:
- Oceania > Australia
- North America > United States
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Pennsylvania > Philadelphia County
- Europe
- Germany > Berlin (0.04)
- United Kingdom > Scotland
- City of Edinburgh > Edinburgh (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia > Philippines
- Luzon > National Capital Region > City of Manila (0.14)
- Genre:
- Research Report > Promising Solution (0.34)
- Technology: