This article is the first installment of a two-post series on Building a machine reading comprehension system using the latest advances in deep learning for NLP. Stay tuned for the second part, where we'll introduce a pre-trained model called BERT that will take your NLP projects to the next level! In the recent past, if you specialized in natural language processing (NLP), there may have been times when you felt a little jealous of your colleagues working in computer vision. It seemed as if they had all the fun: the annual ImageNet classification challenge, Neural Style Transfer, Generative Adversarial Networks, to name a few. At last, the dry spell is over, and the NLP revolution is well underway!
In the previous stories we discussed about Transformers models and their application and did some detailed discussion about the Encoder blocks architecture. In this article we are going to look more on Decoder blocks, the another main building block of the transformers. The architecture of the Decoder is similar to the Encoder model that we discussed previously. It consists of stack of decoders which are identical in structure. The output of encoder will pass it to the decoder as input as sequences and the process will continues until a specific symbol is reached that indicate that the output is completed eg: When we decode the sentence "Welcome to NYC." using decoder the each word will have a numerical representation or feature vectors as output by the decoder model and when the "." symbol passes to the decoder it identifies that the output is completed.
Transformer models have become the defacto standard for NLP tasks. As an example, I'm sure you've already seen the awesome GPT3 Transformer demos and articles detailing how much time and money it took to train. But even outside of NLP, you can also find transformers in the fields of computer vision and music generation. That said, for such a useful model, transformers are still very difficult to understand. It took me multiple readings of the Google research paper first introducing transformers, and a host of blog posts to really understand how transformers work. I'll try to keep the jargon and the technicality to a minimum, but do keep in mind that this topic is complicated. I'll also include some basic math and try to keep things light to ensure the long journey is fun. Q: Why should I understand Transformers? In the past, the state of the art approach to language modeling problems (put simply, predicting the next word) and translations systems was the LSTM and GRU architecture (explained here) along with the attention mechanism.
Transformers have become the defacto standard for any NLP tasks nowadays. Not only that, but they are now also being used in Computer Vision and to generate music. I am sure you would all have heard about the GPT3 Transformer and its applications thereof. But all these things aside, they are still hard to understand as ever. It has taken me multiple readings through the Google research paper that first introduced transformers along with just so many blog posts to really understand how a transformer works. So, I thought of putting the whole idea down in as simple words as possible along with some very basic Math and some puns as I am a proponent of having some fun while learning. I will try to keep both the jargon and the technicality to a minimum, yet it is such a topic that I could only do so much. And my goal is to make the reader understand even the goriest details of Transformer by the end of this post. Also, this is officially my longest post both in terms of time taken to write it as well as the length of the post. Hence, I will advise you to Grab A Coffee.
In this part, we will try to understand the Encoder-Decoder architecture of the Multi-Head Self-Attention Transformer network with some code in PyTorch. There won't be any theory involved(better theoretical version can be found here) just the barebones of the network and how can one write this network on its own in PyTorch. The architecture comprising the Transformer model is divided into two parts -- the Encoder part and the Decoder part. Several other things combine to form the Encoder and Decoder parts. Let's start with the Encoder.