Last year in August Google released the Transformer, a novel neural network architecture based on a self-attention mechanism particularly well suited for language understanding. Before the Transformer, most neural network based approaches to machine translation relied on recurrent neural networks (RNNs) which operated sequentially using recurrence. In contrast to RNN-based approaches, the Transformer used no recurrence, instead it processed all words or symbols in the sequence and let each word attend the other word over multiple processing steps using a self-attention mechanism to incorporate context from words farther away. This approach led Transformer to train the recurrent models much faster and yield better translation results than RNNs. "However, on smaller and more structured language understanding tasks, or even simple algorithmic tasks such as copying a string (e.g. to transform an input of "abc" to "abcabc"), the Transformer does not perform very well.", says Stephan Gouws and Mostafa Dehghani from the Google Brain team.
In this way the network architecture is able to respond to an utterance with an response. Last year this concept was generalized to including a dialog encoder layer on top of the standard encoder. This might further enhance the architecture to keep track of previous utterances in a full dialog. The Sequence-To-Sequence architectures as every machine learning system has to undergo a certain training process. Here, the encoder and the decoder are trained together by presenting corresponding sequence pairs to them.
We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during training. We apply a slightly modified sequence-to-sequence with attention architecture that has previously been used for speech recognition and show that it can be repurposed for this more complex task, illustrating the power of attention-based models. A single model trained end-to-end obtains state-of-the-art performance on the Fisher Callhome Spanish-English speech translation task, outperforming a cascade of independently trained sequence-to-sequence speech recognition and machine translation models by 1.8 BLEU points on the Fisher test set. In addition, we find that making use of the training data in both languages by multi-task training sequence-to-sequence speech translation and recognition models with a shared encoder network can improve performance by a further 1.4 BLEU points.
Machine translation data can be the potential equivalent to ImageNet in NLP. The authors try to prove this hypothesis by adopting the attentional sequence-to-sequence model trained for machine translation. These models usually contain an LSTM-based encoder. They make use of this encoder to obtain, what they call, the contextual vectors. This process involves two steps.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.