This article is the first installment of a two-post series on Building a machine reading comprehension system using the latest advances in deep learning for NLP. Stay tuned for the second part, where we'll introduce a pre-trained model called BERT that will take your NLP projects to the next level! In the recent past, if you specialized in natural language processing (NLP), there may have been times when you felt a little jealous of your colleagues working in computer vision. It seemed as if they had all the fun: the annual ImageNet classification challenge, Neural Style Transfer, Generative Adversarial Networks, to name a few. At last, the dry spell is over, and the NLP revolution is well underway!
In the previous stories we discussed about Transformers models and their application and did some detailed discussion about the Encoder blocks architecture. In this article we are going to look more on Decoder blocks, the another main building block of the transformers. The architecture of the Decoder is similar to the Encoder model that we discussed previously. It consists of stack of decoders which are identical in structure. The output of encoder will pass it to the decoder as input as sequences and the process will continues until a specific symbol is reached that indicate that the output is completed eg: When we decode the sentence "Welcome to NYC." using decoder the each word will have a numerical representation or feature vectors as output by the decoder model and when the "." symbol passes to the decoder it identifies that the output is completed.
We know that we used logo from Transformers in the featured image, so if you are a toy/movies/cartoon fan, sorry to disappoint you. We won't cover any of those topics in this blog post. However, if you are data science and deep learning fan, you are in the right place. In this article, we explore the interesting architecture of Transformers. They are a special type of sequence-to-sequence models used for language modeling, machine translation, image captioning and text generation.
This year, we saw a dazzling application of machine learning. The OpenAI GPT-2 exhibited impressive ability of writing coherent and passionate essays that exceed what we anticipated current language models are able to produce. The GPT-2 wasn't a particularly novel architecture – it's architecture is very similar to the decoder-only transformer. The GPT2 was, however, a very large, transformer-based language model trained on a massive dataset. In this post, we'll look at the architecture that enabled the model to produce its results. We will go into the depths of its self-attention layer. My goal here is to also supplement my earlier post, The Illustrated Transformer, with more visuals explaining the inner-workings of transformers, and how they've evolved since the original paper. My hope is that this visual language will hopefully make it easier to explain later Transformer-based models as their inner-workings continue to evolve.
This is arguably the most important architecture for natural language processing (NLP) today. Specifically, we look at modeling frameworks such as the generative pretrained transformer (GPT), bidirectional encoder representations from transformers (BERT) and multilingual BERT (mBERT). These methods employ neural networks with more parameters than most deep convolutional and recurrent neural network models. Despite the larger size, they've exploded in popularity because they scale comparatively more effectively on parallel computing architecture. This enables even larger and more sophisticated models to be developed in practice. Until the arrival of the transformer, the dominant NLP models relied on recurrent and convolutional components. Additionally, the best sequence modeling and transduction problems, such as machine translation, rely on an encoder-decoder architecture with an attention mechanism to detect which parts of the input influence each part of the output. The transformer aims to replace the recurrent and convolutional components entirely with attention.