Machine Translation
Translating and Evolving: Towards a Model of Language Change in DisCoCat
Bradley, Tai-Danae, Lewis, Martha, Master, Jade, Theilman, Brad
The categorical compositional distributional (DisCoCat) model of meaning developed by Coecke et al. (2010) has been successful in modeling various aspects of meaning. However, it fails to model the fact that language can change. We give an approach to DisCoCat that allows us to represent language models and translations between them, enabling us to describe translations from one language to another, or changes within the same language. We unify the product space representation given in (Coecke et al., 2010) and the functorial description in (Kartsaklis et al., 2013), in a way that allows us to view a language as a catalogue of meanings. We formalize the notion of a lexicon in DisCoCat, and define a dictionary of meanings between two lexicons. All this is done within the framework of monoidal categories. We give examples of how to apply our methods, and give a concrete suggestion for compositional translation in corpora.
Blockwise Parallel Decoding for Deep Autoregressive Models
Stern, Mitchell, Shazeer, Noam, Uszkoreit, Jakob
Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make different trade-offs between the amount of computation needed per layer and the length of the critical path at training time, generation still remains an inherently sequential process. To overcome this limitation, we propose a novel blockwise parallel decoding scheme in which we make predictions for multiple time steps in parallel then back off to the longest prefix validated by a scoring model. This allows for substantial theoretical improvements in generation speed when applied to architectures that can process output sequences in parallel. We verify our approach empirically through a series of experiments using state-of-the-art self-attention models for machine translation and image super-resolution, achieving iteration reductions of up to 2x over a baseline greedy decoder with no loss in quality, or up to 7x in exchange for a slight decrease in performance. In terms of wall-clock time, our fastest models exhibit real-time speedups of up to 4x over standard greedy decoding.
Artificial Intelligence Will Be the Greatest Jobs Engine the World Has Ever Seen
In the past few years, artificial intelligence has advanced so quickly that it now seems that hardly a month goes by without a newsworthy AI breakthrough. In areas as wide-ranging as speech translation, medical diagnosis and game play, we have seen computers outperform humans in startling ways. This has sparked a discussion about what impact AI will have on employment. Some fear that as AI improves, it will supplant workers in the job force, creating an ever-growing pool of unemployable humans who cannot economically compete with machines in any meaningful way. This concern, while understandable, is unfounded.
Neural Phrase-to-Phrase Machine Translation
Feng, Jiangtao, Kong, Lingpeng, Huang, Po-Sen, Wang, Chong, Huang, Da, Mao, Jiayuan, Qiao, Kan, Zhou, Dengyong
In recent years, we have witnessed the surge of neural sequence to sequence (seq2seq) models (Bah-danau et al., 2014; Sutskever et al., 2014). Gehring et al., 2017) and training techniques (V aswani et al., 2017; Ba et al., 2016) keep advancing Until recently, Huang et al. (2018) developed Neural Phrase-based Machine Translation This work was done when Jiangtao and Jiayuan interned in Google. We use "ยทยทยท " to indicate all the possible segmentsx In our model, given the phrase-level attentions, we develop a dictionary lookup decoding method with an external phrase-to-phrase dictionary. We show how it avoids the more costly dynamic programming used in NPMT (Huang et al., For segment indexn 1,..., (a) Update the attention state given all previous segments, a Similar to NPMT in Huang et al. (2018), direct computing Eq. (5) is intractable. We also need to develop a dynamic programming algorithms to efficiently compute the loss function.
Learning to Segment Inputs for NMT Favors Character-Level Processing
Kreutzer, Julia, Sokolov, Artem
Most modern neural machine translation (NMT) systems rely on presegmented inputs. Segmentation granularity importantly determines the input and output sequence lengths, hence the modeling depth, and source and target vocabularies, which in turn determine model size, computational costs of softmax normalization, and handling of out-of-vocabulary words. However, the current practice is to use static, heuristic-based segmentations that are fixed before NMT training. This begs the question whether the chosen segmentation is optimal for the translation task. To overcome suboptimal segmentation choices, we present an algorithm for dynamic segmentation based on the Adaptative Computation Time algorithm (Graves 2016), that is trainable end-to-end and driven by the NMT objective. In an evaluation on four translation tasks we found that, given the freedom to navigate between different segmentation levels, the model prefers to operate on (almost) character level, providing support for purely character-level NMT models from a novel angle.
Content preserving text generation with attribute controls
Logeswaran, Lajanugen, Lee, Honglak, Bengio, Samy
In this work, we address the problem of modifying textual attributes of sentences. Given an input sentence and a set of attribute labels, we attempt to generate sentences that are compatible with the conditioning information. To ensure that the model generates content compatible sentences, we introduce a reconstruction loss which interpolates between auto-encoding and back-translation loss components. We propose an adversarial loss to enforce generated samples to be attribute compatible and realistic. Through quantitative, qualitative and human evaluations we demonstrate that our model is capable of generating fluent sentences that better reflect the conditioning information compared to prior methods. We further demonstrate that the model is capable of simultaneously controlling multiple attributes.
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
Artetxe, Mikel, Schwenk, Holger
Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. Our approach uses an encoder-decoder trained over an initial parallel corpus to build multilingual sentence representations, which are then incorporated into a new margin-based method to score, mine and filter parallel sentences. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our proposed method accounts for the scale inconsistencies of this measure, considering the margin between a given sentence pair and its closest candidates instead. Our experiments show large improvements over existing methods. We outperform the best published results on the BUCC shared task on parallel corpus mining by more than 10 F1 points. We also improve the precision from 48.9 to 83.3 on the reconstruction of 11.3M English-French sentence pairs of the UN corpus. Finally, filtering the English-German ParaCrawl corpus with our approach, we obtain 31.2 BLEU points on newstest2014, an improvement of more than one point over the best official filtered version.
How AI Is Challenging Traditional Translators - DZone AI
In the last decade, translation services have grown exponentially to include hardware devices such as Travis Translator, earphones such as Waverly Labs' pilot, Microsoft Translator, -- which not only translates text, but also speech, images, and street signs -- Google translate, and Facebook translation. Translations are occurring faster and with greater accuracy thanks to machine translation. But what does this mean for the traditional translator? As an expatriate in Germany, I am a user of both translation services and translation software, so I was interested to find out more. I spoke with the CEO and founder of Gengo, Matt Romaine.
Gated Hierarchical Attention for Image Captioning
Wang, Qingzhong, Chan, Antoni B.
Attention modules connecting encoder and decoders have been widely applied in the field of object recognition, image captioning, visual question answering and neural machine translation, and significantly improves the performance. In this paper, we propose a bottom-up gated hierarchical attention (GHA) mechanism for image captioning. Our proposed model employs a CNN as the decoder which is able to learn different concepts at different layers, and apparently, different concepts correspond to different areas of an image. Therefore, we develop the GHA in which low-level concepts are merged into high-level concepts and simultaneously low-level attended features pass to the top to make predictions. Our GHA significantly improves the performance of the model that only applies one level attention, for example, the CIDEr score increases from 0.923 to 0.999, which is comparable to the state-of-the-art models that employ attributes boosting and reinforcement learning (RL). We also conduct extensive experiments to analyze the CNN decoder and our proposed GHA, and we find that deeper decoders cannot obtain better performance, and when the convolutional decoder becomes deeper the model is likely to collapse during training.
On Controllable Sparse Alternatives to Softmax
Laha, Anirban, Chemmengath, Saneem A., Agrawal, Priyanka, Khapra, Mitesh M., Sankaranarayanan, Karthik, Ramaswamy, Harish G.
Converting an n-dimensional vector to a probability distribution over n objects is a commonly used component in many machine learning tasks like multiclass classification, multilabel classification, attention mechanisms etc. For this, several probability mapping functions have been proposed and employed in literature such as softmax, sum-normalization, spherical softmax, and sparsemax, but there is very little understanding in terms how they relate with each other. Further, none of the above formulations offer an explicit control over the degree of sparsity. To address this, we develop a unified framework that encompasses all these formulations as special cases. This framework ensures simple closed-form solutions and existence of sub-gradients suitable for learning via backpropagation. Within this framework, we propose two novel sparse formulations, sparsegen-lin and sparsehourglass, that seek to provide a control over the degree of desired sparsity. We further develop novel convex loss functions that help induce the behavior of aforementioned formulations in the multilabel classification setting, showing improved performance. We also demonstrate empirically that the proposed formulations, when used to compute attention weights, achieve better or comparable performance on standard seq2seq tasks like neural machine translation and abstractive summarization.