Goto

Collaborating Authors

 Machine Translation


Effective General-Domain Data Inclusion for the Machine Translation Task by Vanilla Transformers

arXiv.org Artificial Intelligence

One of the vital breakthroughs in the history of machine translation is the development of the Transformer model. Not only it is revolutionary for various translation tasks, but also for a majority of other NLP tasks. In this paper, we aim at a Transformer-based system that is able to translate a source sentence in German to its counterpart target sentence in English. We perform the experiments on the news commentary German-English parallel sentences from the WMT'13 dataset. In addition, we investigate the effect of the inclusion of additional general-domain data in training from the IWSLT'16 dataset to improve the Transformer model performance. We find that including the IWSLT'16 dataset in training helps achieve a gain of 2 BLEU score points on the test set of the WMT'13 dataset. Qualitative analysis is introduced to analyze how the usage of general-domain data helps improve the quality of the produced translation sentences.


From Zero to Production: Baltic-Ukrainian Machine Translation Systems to Aid Refugees

arXiv.org Artificial Intelligence

In this paper, we examine the development and usage of six low-resource machine translation systems translating between the Ukrainian language and each of the official languages of the Baltic states. We developed these systems in reaction to the escalating Ukrainian refugee crisis caused by the Russian military aggression in Ukraine in the hope that they might be helpful for refugees and public administrations. Now, two months after MT systems were made public, we analyze their usage patterns and statistics. Our findings show that the Latvian-Ukrainian and Lithuanian-Ukrainian systems are integrated into the public services of Baltic states, leading to more than 127 million translated sentences for the Lithuanian-Ukrainian system. Motivated by these findings, we further enhance our MT systems by better Ukrainian toponym translation and publish an improved version of the Lithuanian-Ukrainian system.


Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

arXiv.org Artificial Intelligence

Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, $Wukong_{ViT-L}$ achieves an average accuracy of 73.03%. For the image-text retrieval task, it achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e.g., Flickr8K-CN, Flickr-30K-CN, COCO-CN, et al. More information can be referred to: https://wukong-dataset.github.io/wukong-dataset/.


Representing 'how you say' with 'what you say': English corpus of focused speech and text reflecting corresponding implications

arXiv.org Artificial Intelligence

In speech communication, how something is said (paralinguistic information) is as crucial as what is said (linguistic information). As a type of paralinguistic information, English speech uses sentence stress, the heaviest prominence within a sentence, to convey emphasis. While different placements of sentence stress communicate different emphatic implications, current speech translation systems return the same translations if the utterances are linguistically identical, losing paralinguistic information. Concentrating on focus, a type of emphasis, we propose mapping paralinguistic information into the linguistic domain within the source language using lexical and grammatical devices. This method enables us to translate the paraphrased text representations instead of the transcription of the original speech and obtain translations that preserve paralinguistic information. As a first step, we present the collection of an English corpus containing speech that differed in the placement of focus along with the corresponding text, which was designed to reflect the implied meaning of the speech. Also, analyses of our corpus demonstrated that mapping of focus from the paralinguistic domain into the linguistic domain involved various lexical and grammatical methods. The data and insights from our analysis will further advance research into paralinguistic translation. The corpus will be published via LDC and our website.


Google Translate: How To Use Your Smartphone Camera To Translate Texts? - AI Magazine

#artificialintelligence

Difficult to do without Google Translate. Whether it is to translate a word, a sentence or an entire text, the tool developed by the American firm and launched in 2006 quickly became essential and one of Google's most used tools. But did you know that it is no longer necessary to type anything in the search bar or in the tool directly? Indeed, thanks to its numerous technological advances, Google now allows us to simply draw the camera of our smartphone. We don't want to offend you by explaining what Google Translate is, its usefulness is directly stated in its name.


Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric

arXiv.org Artificial Intelligence

We hypothesize that existing sentence-level machine translation (MT) metrics become less effective when the human reference contains ambiguities. To verify this hypothesis, we present a very simple method for extending pretrained metrics to incorporate context at the document level. We apply our method to three popular metrics, BERTScore, Prism, and COMET, and to the reference free metric COMET-QE. We evaluate the extended metrics on the WMT 2021 metrics shared task using the provided MQM annotations. Our results show that the extended metrics outperform their sentence-level counterparts in about 85% of the tested conditions, when excluding results on low-quality human references. Additionally, we show that our document-level extension of COMET-QE dramatically improves its accuracy on discourse phenomena tasks, outperforming a dedicated baseline by up to 6.1%. Our experimental results support our initial hypothesis and show that a simple extension of the metrics permits them to take advantage of context to resolve ambiguities in the reference.


Facilitating Global Team Meetings Between Language-Based Subgroups: When and How Can Machine Translation Help?

arXiv.org Artificial Intelligence

Global teams frequently consist of language-based subgroups who put together complementary information to achieve common goals. Previous research outlines a two-step work communication flow in these teams. There are team meetings using a required common language (i.e., English); in preparation for those meetings, people have subgroup conversations in their native languages. Work communication at team meetings is often less effective than in subgroup conversations. In the current study, we investigate the idea of leveraging machine translation (MT) to facilitate global team meetings. We hypothesize that exchanging subgroup conversation logs before a team meeting offers contextual information that benefits teamwork at the meeting. MT can translate these logs, which enables comprehension at a low cost. To test our hypothesis, we conducted a between-subjects experiment where twenty quartets of participants performed a personnel selection task. Each quartet included two English native speakers (NS) and two non-native speakers (NNS) whose native language was Mandarin. All participants began the task with subgroup conversations in their native languages, then proceeded to team meetings in English. We manipulated the exchange of subgroup conversation logs prior to team meetings: with MT-mediated exchanges versus without. Analysis of participants' subjective experience, task performance, and depth of discussions as reflected through their conversational moves jointly indicates that team meeting quality improved when there were MT-mediated exchanges of subgroup conversation logs as opposed to no exchanges. We conclude with reflections on when and how MT could be applied to enhance global teamwork across a language barrier.


Improving Multilingual Neural Machine Translation System for Indic Languages

arXiv.org Artificial Intelligence

Machine Translation System (MTS) serves as an effective tool for communication by translating text or speech from one language to another language. The need of an efficient translation system becomes obvious in a large multilingual environment like India, where English and a set of Indian Languages (ILs) are officially used. In contrast with English, ILs are still entreated as low-resource languages due to unavailability of corpora. In order to address such asymmetric nature, multilingual neural machine translation (MNMT) system evolves as an ideal approach in this direction. In this paper, we propose a MNMT system to address the issues related to low-resource language translation. Our model comprises of two MNMT systems i.e. for English-Indic (one-to-many) and the other for Indic-English (many-to-one) with a shared encoder-decoder containing 15 language pairs (30 translation directions). Since most of IL pairs have scanty amount of parallel corpora, not sufficient for training any machine translation model. We explore various augmentation strategies to improve overall translation quality through the proposed model. A state-of-the-art transformer architecture is used to realize the proposed model. Trials over a good amount of data reveal its superiority over the conventional models. In addition, the paper addresses the use of language relationships (in terms of dialect, script, etc.), particularly about the role of high-resource languages of the same family in boosting the performance of low-resource languages. Moreover, the experimental results also show the advantage of backtranslation and domain adaptation for ILs to enhance the translation quality of both source and target languages. Using all these key approaches, our proposed model emerges to be more efficient than the baseline model in terms of evaluation metrics i.e BLEU (BiLingual Evaluation Understudy) score for a set of ILs.


Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET

arXiv.org Artificial Intelligence

Neural metrics have achieved impressive correlation with human judgements in the evaluation of machine translation systems, but before we can safely optimise towards such metrics, we should be aware of (and ideally eliminate) biases toward bad translations that receive high scores. Our experiments show that sample-based Minimum Bayes Risk decoding can be used to explore and quantify such weaknesses. When applying this strategy to COMET for en-de and de-en, we find that COMET models are not sensitive enough to discrepancies in numbers and named entities. We further show that these biases are hard to fully remove by simply training on additional synthetic data and release our code and data for facilitating further experiments.


Informative Text Generation from Knowledge Triples

arXiv.org Artificial Intelligence

As the development of the encoder-decoder architecture, researchers are able to study the text generation tasks with broader types of data. Among them, KB-to-text aims at converting a set of knowledge triples into human readable sentences. In the original setting, the task assumes that the input triples and the text are exactly aligned in the perspective of the embodied knowledge/information. In this paper, we extend this setting and explore how to facilitate the trained model to generate more informative text, namely, containing more information about the triple entities but not conveyed by the input triples. To solve this problem, we propose a novel memory augmented generator that employs a memory network to memorize the useful knowledge learned during the training and utilizes such information together with the input triples to generate text in the operational or testing phase. We derive a dataset from WebNLG for our new setting and conduct extensive experiments to investigate the effectiveness of our model as well as uncover the intrinsic characteristics of the setting.