Q: That's alright but, how does an encoder stack encode an English sentence exactly? Patience, I am getting to it. So, as I said the encoder stack contains six encoder layers on top of each other(As given in the paper, but the future versions of transformers use even more layers). Don't lose me yet as I will explain both of them in the coming sections. Right now, just remember that the encoder layer incorporates attention and a position-wise feed-forward network.
In this part, we will try to understand the Encoder-Decoder architecture of the Multi-Head Self-Attention Transformer network with some code in PyTorch. There won't be any theory involved(better theoretical version can be found here) just the barebones of the network and how can one write this network on its own in PyTorch. The architecture comprising the Transformer model is divided into two parts -- the Encoder part and the Decoder part. Several other things combine to form the Encoder and Decoder parts. Let's start with the Encoder.
This article is the first installment of a two-post series on Building a machine reading comprehension system using the latest advances in deep learning for NLP. Stay tuned for the second part, where we'll introduce a pre-trained model called BERT that will take your NLP projects to the next level! In the recent past, if you specialized in natural language processing (NLP), there may have been times when you felt a little jealous of your colleagues working in computer vision. It seemed as if they had all the fun: the annual ImageNet classification challenge, Neural Style Transfer, Generative Adversarial Networks, to name a few. At last, the dry spell is over, and the NLP revolution is well underway!
In the previous stories we discussed about Transformers models and their application and did some detailed discussion about the Encoder blocks architecture. In this article we are going to look more on Decoder blocks, the another main building block of the transformers. The architecture of the Decoder is similar to the Encoder model that we discussed previously. It consists of stack of decoders which are identical in structure. The output of encoder will pass it to the decoder as input as sequences and the process will continues until a specific symbol is reached that indicate that the output is completed eg: When we decode the sentence "Welcome to NYC." using decoder the each word will have a numerical representation or feature vectors as output by the decoder model and when the "." symbol passes to the decoder it identifies that the output is completed.
Encoders in sequence-to-sequence architecture are meant to give an output of word embeddings providing the context of the word. And the decoding component converts these embeddings to the human observable sequences. The transformer contains an encoder and decoder for Machine Neural Translation. BERT takes only the encoder part from Transformers and replicates the encoders to form a stack. Basically, BERT is a stack of encoders.