Machine Translation
Improving Robustness of Task Oriented Dialog Systems
Einolghozati, Arash, Gupta, Sonal, Mohit, Mrinal, Shah, Rushin
Task oriented language understanding in dialog systems is often modeled using intents (task of a query) and slots (parameters for that task). Intent detection and slot tagging are, in turn, modeled using sentence classification and word tagging techniques respectively. Similar to adversarial attack problems with computer vision models discussed in existing literature, these intent-slot tagging models are often over-sensitive to small variations in input -- predicting different and often incorrect labels when small changes are made to a query, thus reducing their accuracy and reliability. However, evaluating a model's robustness to these changes is harder for language since words are discrete and an automated change (e.g. adding `noise') to a query sometimes changes the meaning and thus labels of a query. In this paper, we first describe how to create an adversarial test set to measure the robustness of these models. Furthermore, we introduce and adapt adversarial training methods as well as data augmentation using back-translation to mitigate these issues. Our experiments show that both techniques improve the robustness of the system substantially and can be combined to yield the best results.
Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding
Sundararaman, Dhanasekar, Subramanian, Vivek, Wang, Guoyin, Si, Shijing, Shen, Dinghan, Wang, Dong, Carin, Lawrence
Attention-based models have shown significant improvement over traditional algorithms in several NLP tasks. The Transformer, for instance, is an illustrative example that generates abstract representations of tokens inputted to an encoder based on their relationships to all tokens in a sequence. Recent studies have shown that although such models are capable of learning syntactic features purely by seeing examples, explicitly feeding this information to deep learning models can significantly enhance their performance. Leveraging syntactic information like part of speech (POS) may be particularly beneficial in limited training data settings for complex models such as the Transformer. We show that the syntax-infused Transformer with multiple features achieves an improvement of 0.7 BLEU when trained on the full WMT '14 English to German translation dataset and a maximum improvement of 1.99 BLEU points when trained on a fraction of the dataset. In addition, we find that the incorporation of syntax into BERT fine-tuning outperforms baseline on a number of downstream tasks from the GLUE benchmark. Introduction Attention-based deep learning models for natural language processing (NLP) have shown promise for a variety of machine translation and natural language understanding tasks. For word-level, sequence-to-sequence tasks such as translation, paraphrasing, and text summarization, attention-based models allow a single token ( e.g., a word or subword) in a sequence to be represented as a combination of all tokens in the sequence (Luong, Pham, and Manning, 2015). The distributed context allows attention-based models to infer rich representations for tokens, leading to more robust performance.
A Massive Collection of Cross-Lingual Web-Document Pairs
El-Kishky, Ahmed, Chaudhary, Vishrav, Guzman, Francisco, Koehn, Philipp
Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Small-scale efforts have been made to collect aligned document level data on a limited set of language-pairs such as English-German or on limited comparable collections such as Wikipedia. In this paper, we mine twelve snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of 54 million URL pairs from Common Crawl covering documents in 92 languages paired with English. We evaluate the quality of the dataset by measuring the quality of machine translations from models that have been trained on mined parallel sentence pairs from this aligned corpora and introduce a simple yet effective baseline for identifying these aligned documents. The objective of this dataset and paper is to foster new research in cross-lingual NLP across a variety of low, mid, and high-resource languages.
Modelling Bahdanau Attention using Election methods aided by Q-Learning
Neural Machine Translation has lately gained a lot of "attention" with the advent of more and more sophisticated but drastically improved models. Attention mechanism has proved to be a boon in this direction by providing weights to the input words, making it easy for the decoder to identify words representing the present context. But by and by, as newer attention models with more complexity came into development, they involved large computation, making inference slow. In this paper, we have modelled the attention network using techniques resonating with social choice theory. Along with that, the attention mechanism, being a Markov Decision Process, has been represented by reinforcement learning techniques. Thus, we propose to use an election method ( k -Borda), fine-tuned using Q-learning, as a replacement for attention networks. The inference time for this network is less than a standard Bahdanau translator, and the results of the translation are comparable. This not only experimentally verifies the claims stated above but also helped provide a faster inference.
Instance-based Transfer Learning for Multilingual Deep Retrieval
Arnold, Andrew O., Cohen, William W.
Perhaps the simplest type of multilingual transfer learning is instance-based transfer learning, in which data from the target language and the auxiliary languages are pooled, and a single model is learned from the pooled data. It is not immediately obvious when instance-based transfer learning will improve performance in this multilingual setting: for instance, a plausible conjecture is this kind of transfer learning would help only if the auxiliary languages were very similar to the target. Here we show that at large scale, this method is surprisingly effective, leading to positive transfer on all of 35 target languages we tested. We analyze this improvement and argue that the most natural explanation, namely direct vocabulary overlap between languages, only partially explains the performance gains: in fact, we demonstrate target-language improvement can occur after adding data from an auxiliary language with no vocabulary in common with the target. This surprising result is due to the effect of transitive vocabulary overlaps between pairs of auxiliary and target languages.
Biconditional Generative Adversarial Networks for Multiview Learning with Missing Views
Doinychko, Anastasiia, Amini, Massih-Reza
In this paper, we present a conditional GAN with two generators and a common discriminator for multiview learning problems where observations have two views, but one of them may be missing for some of the training samples. This is for example the case for multilingual collections where documents are not available in all languages. Some studies tackled this problem by assuming the existence of view generation functions to approximately complete the missing views; for example Machine Translation to translate documents into the missing languages. These functions generally require an external resource to be set and their quality has a direct impact on the performance of the learned multiview classifier over the completed training set. Our proposed approach addresses this problem by jointly learning the missing views and the multiview classifier using a tripartite game with two generators and a discriminator. Each of the generators is associated to one of the views and tries to fool the discriminator by generating the other missing view conditionally on the corresponding observed view. The discriminator then tries to identify if for an observation, one of its views is completed by one of the generators or if both views are completed along with its class. Our results on a subset of Reuters RCV1/RCV2 collections show that the discriminator achieves significant classification performance; and that the generators learn the missing views with high quality without the need of any consequent external resource.
Can Neural Networks Learn Symbolic Rewriting?
Piotrowski, Bartosz, Urban, Josef, Brown, Chad E., Kaliszyk, Cezary
This work investigates if the current neural architectures are adequate for learning symbolic rewriting. Two kinds of data sets are proposed for this research -- one based on automated proofs and the other being a synthetic set of polynomial terms. The experiments with use of the current neural machine translation models are performed and its results are discussed. Ideas for extending this line of research are proposed and its relevance is motivated.
Google's New AI Milestone: Neural Machine Translation Engine Can Now Translate 103 Languages
Neural Machine Translation (NMT), one of the most important topics in deep learning, has gained much attention from the industries and academia over the last few years. In order to create simple models out of the complex ones, tech giant Google has been doing a lot of innovations in the domain of human to machine and machine to human translations for quite a few years now. Back in 2017, the tech giant introduced a solution to use a simple Neural Machine Translation (NMT) model to translate between multiple languages where the researchers merged 12 language pairs into a single model. Models into three types which are many-to-one, one-to-many and many-to-many models. Recently, the researchers at Google AI Team built a more enhanced system for neural machine translation (NMT) and published a paper known as "Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges".
Microsoft Research Asia's Systems for WMT19
Xia, Yingce, Tan, Xu, Tian, Fei, Gao, Fei, Chen, Weicong, Fan, Yang, Gong, Linyuan, Leng, Yichong, Luo, Renqian, Wang, Yiren, Wu, Lijun, Zhu, Jinhua, Qin, Tao, Liu, Tie-Yan
Yingce Xia, Xu T an, Fei Tian, Fei Gao, Weicong Chen, Y ang Fan, Linyuan Gong, Yichong Leng, Renqian Luo, Yiren Wang, Lijun Wu, Jinhua Zhu, T ao Qin, Tie-Y an Liu Microsoft Research Asia Abstract We Microsoft Research Asia made submissions to 11 language directions in the WMT19 news translation tasks. We won the first place for 8 of the 11 directions and the second place for the other three. Our basic systems are built on Transformer, back translation and knowledge distillation. We integrate several of our rececent techniques to enhance the baseline systems: multi-agent dual learning (MADL), masked sequence-to-sequence pre-training (MASS), neural architecture optimization (NAO), and soft contextual data augmentation (SCA). 1 Introduction We participated in the WMT19 shared news translation task in 11 translation directions. We achieved first place for 8 directions: German English, German French, Chinese English, English Lithuanian, English Finnish, and Russian English, and three other directions were placed second (ranked by teams), which included Lithuanian English, Finnish English, and English Kazakh. Our basic systems are based on Transformer, back translation and knowledge distillation. We experimented with several techniques we proposed recently. In brief, the innovations we introduced are: Multi-agent dual learning (MADL) The core idea of dual learning is to leverage the duality between the primal task (mapping from domain X to domain Y) and dual task (mapping from domain Y to X) to boost the performances of both tasks. MADL (Wang et al., 2019) extends the dual learning (He et al., 2016; Xia et al., 2017a) framework by introducing multiple primal and dual models. It was integrated into our submitted systems for*Corresponding author.
Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation
Bogoychev, Nikolay, Sennrich, Rico
The quality of neural machine translation can be improved by leveraging additional monolingual resources to create synthetic training data. Source-side monolingual data can be (forward-)translated into the target language for self-training; target-side monolingual data can be back-translated. It has been widely reported that back-translation delivers superior results, but could this be due to artefacts in the test sets? W e perform a case study using French-English news translation task and separate test sets based on their original languages. W e show that forward translation delivers superior gains in terms of BLEU on sentences that were originally in the source language, complementing previous studies which show large improvements with back-translation on sentences that were originally in the target language. To better understand when and why forward and back-translation are effective, we study the role of domains, translationese, and noise. While translationese effects are well known to influence MT evaluation, we also find evidence that news data from different languages shows subtle domain differences, which is another explanation for varying performance on different portions of the test set. W e perform additional low-resource experiments which demonstrate that forward translation is more sensitive to the quality of the initial translation system than back-translation, and tends to perform worse in low-resource settings.