Machine Translation
5 Innovative Applications of Automated Machine Learning
Machine Learning is a popular expression in the innovation world at this moment, it represents a significant step forward in how PCs can learn. The requirement for Machine Learning Engineers is high in demand and this flood is due to evolving innovation and generation of huge measures of information known as Big Data. Automated Machine Learning consolidates best AI practices from top-ranked data researchers to make Data Science progressively accessible over the organization. Also, Automated Machine Learning empowers business clients to execute AI solutions easily, along these lines permitting an organization's data researchers to concentrate on progressively complex issues. As we are moving ahead into the digital era, one of the cutting-edge developments we have seen is Machine Learning.
Lookahead Optimizer: k steps forward, 1 step back
Zhang, Michael, Lucas, James, Ba, Jimmy, Hinton, Geoffrey E.
The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights" generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost.
Neural Machine Translation with Soft Prototype
Wang, Yiren, Xia, Yingce, Tian, Fei, Gao, Fei, Qin, Tao, Zhai, Cheng Xiang, Liu, Tie-Yan
Neural machine translation models usually use the encoder-decoder framework and generate translation from left to right (or right to left) without fully utilizing the target-side global information. A few recent approaches seek to exploit the global information through two-pass decoding, yet have limitations in translation quality and model efficiency. In this work, we propose a new framework that introduces a soft prototype into the encoder-decoder architecture, which allows the decoder to have indirect access to both past and future information, such that each target word can be generated based on the better global understanding. We further provide an efficient and effective method to generate the prototype. Empirical studies on various neural machine translation tasks show that our approach brings significant improvement in generation quality over the baseline model, with little extra cost in storage and inference time, demonstrating the effectiveness of our proposed framework.
Fast Structured Decoding for Sequence Models
Sun, Zhiqing, Li, Zhuohan, Wang, Haoqing, He, Di, Lin, Zi, Deng, Zhihong
Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently, non-autoregressive sequence models were proposed to speed up the inference time. However, these models assume that the decoding process of each token is conditionally independent of others. Such a generation process sometimes makes the output sentence inconsistent, and thus the learned non-autoregressive models could only achieve inferior accuracy compared to their autoregressive counterparts.
A Tensorized Transformer for Language Modeling
Ma, Xindian, Zhang, Peng, Zhang, Shuai, Duan, Nan, Hou, Yuexian, Zhou, Ming, Song, Dawei
Latest development of neural models has connected the encoder and decoder through a self-attention mechanism. In particular, Transformer, which is solely based on self-attention, has led to breakthroughs in Natural Language Processing (NLP) tasks. However, the multi-head attention mechanism, as a key component of Transformer, limits the effective deployment of the model to a resource-limited setting. In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). We test and verify the proposed attention method on three language modeling tasks (i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task (i.e., WMT-2016 English-German). Multi-linear attention can not only largely compress the model parameters but also obtain performance improvements, compared with a number of language modeling approaches, such as Transformer, Transformer-XL, and Transformer with tensor train decomposition.
Leveraging Foreign Language Labeled Data for Aspect-Based Opinion Mining
Thuy, Nguyen Thi Thanh, Bach, Ngo Xuan, Phuong, Tu Minh
Aspect-based opinion mining is the task of identifying sentiment at the aspect level in opinionated text, which consists of two subtasks: aspect category extraction and sentiment polarity classification. While aspect category extraction aims to detect and categorize opinion targets such as product features, sentiment polarity classification assigns a sentiment label, i.e. positive, negative, or neutral, to each identified aspect. Supervised learning methods have been shown to deliver better accuracy for this task but they require labeled data, which is costly to obtain, especially for resource-poor languages like Vietnamese. To address this problem, we present a supervised aspect-based opinion mining method that utilizes labeled data from a foreign language (English in this case), which is translated to Vietnamese by an automated translation tool (Google Translate). Because aspects and opinions in different languages may be expressed by different words, we propose using word embeddings, in addition to other features, to reduce the vocabulary difference between the original and translated texts, thus improving the effectiveness of aspect category extraction and sentiment polarity classification processes. We also introduce an annotated corpus of aspect categories and sentiment polarities extracted from restaurant reviews in Vietnamese, and conduct a series of experiments on the corpus. Experimental results demonstrate the effectiveness of the proposed approach.
ASR Error Correction and Domain Adaptation Using Machine Translation
Mani, Anirudh, Palaskar, Shruti, Meripo, Nimshi Venkat, Konam, Sandeep, Metze, Florian
Off-the-shelf pre-trained Automatic Speech Recognition (ASR) systems are an increasingly viable service for companies of any size building speech-based products. While these ASR systems are trained on large amounts of data, domain mismatch is still an issue for many such parties that want to use this service as-is leading to not so optimal results for their task. We propose a simple technique to perform domain adaptation for ASR error correction via machine translation. The machine translation model is a strong candidate to learn a mapping from out-of-domain ASR errors to in-domain terms in the corresponding reference files. We use two off-the-shelf ASR systems in this work: Google ASR (commercial) and the ASPIRE model (open-source). We observe 7% absolute improvement in word error rate and 4 point absolute improvement in BLEU score in Google ASR output via our proposed method. We also evaluate ASR error correction via a downstream task of Speaker Diarization that captures speaker style, syntax, structure and semantic improvements we obtain via ASR correction.
Moral Machines: Translation Suppliers and AI Ethics
That said, service suppliers might still be concerned about potential bias within the datasets used to build automated translation solutions. The response could be that (post)editing is basically tasked with removing any traces of unwanted "bias" generated by an unthinking machine. It would, of course, be interesting to know whether we can teach the technology to automatically isolate potential bias (in the "social" sense) from semantic error in the industry sense of mistranslation. Or more subtly, could translating something accurately unwittingly induce a sentiment of bias for a given native speaker? Going forward, the pursuit of translation accuracy may require social inclusiveness in certain cases to address the emerging norms of new language user communities.
Learning to Encode Position for Transformer with Continuous Dynamical Model
Liu, Xuanqing, Yu, Hsiang-Fu, Dhillon, Inderjit, Hsieh, Cho-Jui
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. Unlike RNN and LSTM, which contain inductive bias by loading the input tokens sequentially, non-recurrent models are less sensitive to position. The main reason is that position information among input units is not inherently encoded, i.e., the models are permutation equivalent; this problem justifies why all of the existing models are accompanied by a sinusoidal encoding/embedding layer at the input. However, this solution has clear limitations: the sinusoidal encoding is not flexible enough as it is manually designed and does not contain any learnable parameters, whereas the position embedding restricts the maximum length of input sequences. It is thus desirable to design a new position layer that contains learnable parameters to adjust to different datasets and different architectures. At the same time, we would also like the encodings to extrapolate in accordance with the variable length of inputs. In our proposed solution, we borrow from the recent Neural ODE approach, which may be viewed as a versatile continuous version of a ResNet. This model is capable of modeling many kinds of dynamical systems. We model the evolution of encoded results along position index by such a dynamical system, thereby overcoming the above limitations of existing methods. We evaluate our new position layers on a variety of neural machine translation and language understanding tasks, the experimental results show consistent improvements over the baselines.
"An Image is Worth a Thousand Features": Scalable Product Representations for In-Session Type-Ahead Personalization
Yu, Bingqing, Tagliabue, Jacopo, Greco, Ciro, Bianchi, Federico
We address the problem of personalizing query completion in a digital commerce setting, in which the bounce rate is typically high and recurring users are rare. We focus on in-session personalization and improve a standard noisy channel model by injecting dense vectors computed from product images at query time. We argue that image-based personalization displays several advantages over alternative proposals (from data availability to business scalability), and provide quantitative evidence and qualitative support on the effectiveness of the proposed methods. Finally, we show how a shared vector space between similar shops can be used to improve the experience of users browsing across sites, opening up the possibility of applying zero-shot unsupervised personalization to increase conversions. This will prove to be particularly relevant to retail groups that manage multiple brands and/or websites and to multi-tenant SaaS providers that serve multiple clients in the same space.