Efficient Machine Translation Corpus Generation

Yuksel, Kamer Ali, Gunduz, Ahmet, Sharma, Shreyas, Sawaf, Hassan

arXiv.org Artificial Intelligence 

Improving MT models requires continuously expanding their MT corpora for re-training cycles by post-editing their outputs on samples received from the production environment. Hence, the MT model lifecycle requires continuous human effort, which could scale and be more efficient by semi-automating it via machine-learning models trained by linguists. Those models can be used to select the maximally useful set of translations to store and post-edit by looking at what is challenging for an MT. They can upsample and prioritize translation outputs from where MTs are not performing well, and reduce costs by post-editing production translations intelligently. The continuous and interactive nature of the MT lifecycle provides the perfect ground for applying active-learning techniques in training those machine-learning models for semi-automation. Custom translation quality or post-editing effort estimation models trained on-the-fly as linguists post-edit translations can be used to prioritize samples accumulating from the model inferences in the production environment. The trained estimators enable to focus the linguist effort on the most challenging samples for the MT model requiring the most post-edits, which are also the most valuable to check for evaluating the MT model quality by humans. In the WMT20 Metrics Shared Task (Mathur et al., 2020), participants were asked to score MT outputs in the WMT20 News Translation Task with automatic metrics, and four referenceless metrics were submitted. Those metrics (OpenKiwi-BERT, OpenKiwi-XLMR, YISI-2, COMET-QE) use bilingual mappings of the contextual embeddings extracted from pre-trained or fine-tuned language models (like XLM-RoBERTa) to evaluate the cross-lingual lexical semantic similarity between the input and MT output.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found