Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation . Its strength comes from the fact that it learns the mapping directly from input text to associated output text. It has been proven to be more effective than traditional phrase-based machine translation, which requires much more effort to design the model. On the other hand, NMT models are costly to train, especially on large-scale translation datasets. They are also significantly slower at inference time due to the large number of parameters used.
In this post, I walk through how to build and train an neural translation model with attention. This model will be used to translate French to English. This post will focus on the conceptual explanation, while a detailed walk through of the project code can be found in the associated Jupyter notebook. This notebook can be viewed here or cloned from the project Github repository, here. This project closely follows the PyTorch Sequence to Sequence tutorial, while attempting to go more in depth with both the model implementation and the explanation. We are trying to build a translation model. One model that has been successful in this task is an Encoder-Decoder network.
Click to learn more about author Rosaria Silipo. Automatic machine translation has been a popular subject for machine learning algorithms. After all, if machines can detect topics and understand texts, translation should be just the next step. Machine translation can be seen as a variation of natural language generation. In a previous project, we worked on the automatic generation of fairy tales (see "Once upon a Time … by LSTM Network").
Attention-based models have shown significant improvement over traditional algorithms in several NLP tasks. The Transformer, for instance, is an illustrative example that generates abstract representations of tokens inputted to an encoder based on their relationships to all tokens in a sequence. Recent studies have shown that although such models are capable of learning syntactic features purely by seeing examples, explicitly feeding this information to deep learning models can significantly enhance their performance. Leveraging syntactic information like part of speech (POS) may be particularly beneficial in limited training data settings for complex models such as the Transformer. We show that the syntax-infused Transformer with multiple features achieves an improvement of 0.7 BLEU when trained on the full WMT '14 English to German translation dataset and a maximum improvement of 1.99 BLEU points when trained on a fraction of the dataset. In addition, we find that the incorporation of syntax into BERT fine-tuning outperforms baseline on a number of downstream tasks from the GLUE benchmark. Introduction Attention-based deep learning models for natural language processing (NLP) have shown promise for a variety of machine translation and natural language understanding tasks. For word-level, sequence-to-sequence tasks such as translation, paraphrasing, and text summarization, attention-based models allow a single token ( e.g., a word or subword) in a sequence to be represented as a combination of all tokens in the sequence (Luong, Pham, and Manning, 2015).
Neural Machine Translation (NMT) has achieved remarkable progress with the quick evolvement of model structures. In this paper, we propose the concept of layer-wise coordination for NMT, which explicitly coordinates the learning of hidden representations of the encoder and decoder together layer by layer, gradually fromlow level to high level. Specifically, we design a layer-wise attention and mixed attention mechanism, and further share the parameters of each layer between the encoder and decoder to regularize and coordinate the learning. Experiments showthat combined with the state-of-the-art Transformer model, layer-wise coordination achieves improvements on three IWSLT and two WMT translation tasks. More specifically, our method achieves 34.43 and 29.01 BLEU score on WMT16 English-Romanian and WMT14 English-German tasks, outperforming the Transformer baseline.