Machine Translation
Weakly Supervised Grammatical Error Correction using Iterative Decoding
Lichtarge, Jared, Alberti, Christopher, Kumar, Shankar, Shazeer, Noam, Parmar, Niki
We describe an approach to Grammatical Error Correction (GEC) that is effective at making use of models trained on large amounts of weakly supervised bitext. We train the Transformer sequence-to-sequence model on 4B tokens of Wikipedia revisions and employ an iterative decoding strategy that is tailored to the loosely-supervised nature of the Wikipedia training corpus. Finetuning on the Lang-8 corpus and ensembling yields an F0.5 of 58.3 on the CoNLL'14 benchmark and a GLEU of 62.4 on JFLEG. The combination of weakly supervised training and iterative decoding obtains an F0.5 of 48.2 on CoNLL'14 even without using any labeled GEC data.
Area Attention
Li, Yang, Kaiser, Lukasz, Bengio, Samy, Si, Si
Existing attention mechanisms, are mostly item-based in that a model is designed to attend to a single item in a collection of items (the memory). Intuitively, an area in the memory that may contain multiple items can be worth attending to as a whole. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area, can vary depending on the learned coherence of the adjacent items. By giving the model the option to attend to an area of items, instead of only a single item, we hope attention mechanisms can better capture the nature of the task. Area attention can work along multi-head attention for attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation and image captioning, and improve upon strong (state-of-the-art) baselines in both cases. These improvements are obtainable with a basic form of area attention that is parameter free. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables.
Learning to Teach with Dynamic Loss Functions
Wu, Lijun, Tian, Fei, Xia, Yingce, Fan, Yang, Qin, Tao, Lai, Jianhuang, Liu, Tie-Yan
Teaching is critical to human society: it is with teaching that prospective students are educated and human civilization can be inherited and advanced. A good teacher not only provides his/her students with qualified teaching materials (e.g., textbooks), but also sets up appropriate learning objectives (e.g., course projects and exams) considering different situations of a student. When it comes to artificial intelligence, treating machine learning models as students, the loss functions that are optimized act as perfect counterparts of the learning objective set by the teacher. In this work, we explore the possibility of imitating human teaching behaviors by dynamically and automatically outputting appropriate loss functions to train machine learning models. Different from typical learning settings in which the loss function of a machine learning model is predefined and fixed, in our framework, the loss function of a machine learning model (we call it student) is defined by another machine learning model (we call it teacher). The ultimate goal of teacher model is cultivating the student to have better performance measured on development dataset. Towards that end, similar to human teaching, the teacher, a parametric model, dynamically outputs different loss functions that will be used and optimized by its student model at different training stages. We develop an efficient learning method for the teacher model that makes gradient based optimization possible, exempt of the ineffective solutions such as policy optimization. We name our method as "learning to teach with dynamic loss functions" (L2T-DLF for short). Extensive experiments on real world tasks including image classification and neural machine translation demonstrate that our method significantly improves the quality of various student models.
Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks
Chen, Patrick H., Si, Si, Kumar, Sanjiv, Li, Yang, Hsieh, Cho-Jui
Neural language models have been widely used in various NLP tasks, including machine translation, next word prediction and conversational agents. However, it is challenging to deploy these models on mobile devices due to their slow prediction speed, where the bottleneck is to compute top candidates in the softmax layer. In this paper, we introduce a novel softmax layer approximation algorithm by exploiting the clustering structure of context vectors. Our algorithm uses a light-weight screening model to predict a much smaller set of candidate words based on the given context, and then conducts an exact softmax only within that subset. Training such a procedure end-to-end is challenging as traditional clustering methods are discrete and non-differentiable, and thus unable to be used with back-propagation in the training process. Using the Gumbel softmax, we are able to train the screening model end-to-end on the training set to exploit data distribution. The algorithm achieves an order of magnitude faster inference than the original softmax layer for predicting top-$k$ words in various tasks such as beam search in machine translation or next words prediction. For example, for machine translation task on German to English dataset with around 25K vocabulary, we can achieve 20.4 times speed up with 98.9\% precision@1 and 99.3\% precision@5 with the original softmax layer prediction, while state-of-the-art ~\citep{MSRprediction} only achieves 6.7x speedup with 98.7\% precision@1 and 98.1\% precision@5 for the same task.
Parallel Attention Mechanisms in Neural Machine Translation
Medina, Julian Richard, Kalita, Jugal
Abstract--Recent papers in neural machine translation have proposed the strict use of attention mechanisms over previous standards such as recurrent and convolutional neural networks (RNNs and CNNs). We propose that by running traditionally stacked encoding branches from encoder-decoder attentionfocused architectures in parallel, that even more sequential operations can be removed from the model and thereby decrease training time. In particular, we modify the recently published attention-based architecture called Transformer by Google, by replacing sequential attention modules with parallel ones, reducing the amount of training time and substantially improving BLEU scores at the same time. Experiments over the English to German and English to French translation tasks show that our model establishes a new state of the art. Historically, statistical machine translation involved extensive work in the alignment of words and phrases developed by linguistic experts working with computer scientists [1].
Bible is providing data to help create AI that can can convert texts
Scientists are now using the Bible to help algorithms perfect their language skills. An AI has been trained on various versions of the sacred text so it can convert written works into different styles for different audiences. Each version of the Bible contains more than 31,000 verses that the researchers used to produce over 1.5 million unique pairings of source and target verses. The Bible is helping algorithms perfect their translation skills. Internet tools that translate text between languages like English and Spanish are widely available.
Multi-Head Attention with Disagreement Regularization
Li, Jian, Tu, Zhaopeng, Yang, Baosong, Lyu, Michael R., Zhang, Tong
Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness and universality of the proposed approach.
Modeling Localness for Self-Attention Networks
Yang, Baosong, Tu, Zhaopeng, Wong, Derek F., Meng, Fandong, Chao, Lidia S., Zhang, Tong
Self-attention networks have proven to be of profound value for its strength of capturing global dependencies. In this work, we propose to model localness for self-attention networks, which enhances the ability of capturing useful local context. We cast localness modeling as a learnable Gaussian bias, which indicates the central and scope of the local region to be paid more attention. The bias is then incorporated into the original attention distribution to form a revised distribution. To maintain the strength of capturing long distance dependencies and enhance the ability of capturing short-range dependencies, we only apply localness modeling to lower layers of self-attention networks. Quantitative and qualitative analyses on Chinese-English and English-German translation tasks demonstrate the effectiveness and universality of the proposed approach.
Exploiting Deep Representations for Neural Machine Translation
Dou, Zi-Yi, Tu, Zhaopeng, Wang, Xing, Shi, Shuming, Zhang, Tong
Advanced neural machine translation (NMT) models generally implement encoder and decoder as multiple layers, which allows systems to model complex functions and capture complicated linguistic structures. However, only the top layers of encoder and decoder are leveraged in the subsequent process, which misses the opportunity to exploit the useful information embedded in other layers. In this work, we propose to simultaneously expose all of these signals with layer aggregation and multi-layer attention mechanisms. In addition, we introduce an auxiliary regularization term to encourage different layers to capture diverse information. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation data demonstrate the effectiveness and universality of the proposed approach.
Baidu's AI Can Do Simultaneous Translation Between Any Two Languages
Would-be travelers of the galaxy, rejoice: The Chinese tech giant Baidu has invented a translation system that brings us one step closer to a software Babel fish. For those unfamiliar with the Douglas Adams masterworks of science fiction, let me explain. The Babel fish is a slithery fictional creature that takes up residence in the ear canal of humans, tapping into their neural systems to provide instant translation of any language they hear. In the real world, until now, we've had to make do with human and software interpreters that do their best to keep up. But the new AI-powered tool from Baidu Research, called STACL, could speed things up considerably.