We have published 7k books, videos, articles, and tutorials. If you've been following developments in deep learning and natural language processing (NLP) over the past few years then you've probably heard of something called BERT; and if you haven't, just know that techniques owing something to BERT will likely play an increasing part in all our digital lives. BERT is a state-of-the-art embedding model published by Google, and it represents a breakthrough in the field of NLP by providing excellent results on many NLP tasks, including question answering, text generation, sentence classification, and more. Here we are going to look at what BERT is and see what is distinctive about it, by looking in a relatively high-level way (eschewing the underlying linear algebra) at the internal workings of the BERT model. By the end you should have if not a detailed understanding then at least a strong sense of what underpins this modern approach to NLP and other methods like it.
Transformers have become the defacto standard for NLP tasks nowadays. While the Transformer architecture was introduced with NLP, they are now being used in Computer Vision and to generate music as well. I am sure you would all have heard about the GPT3 Transformer and its applications thereof. But all these things aside, they are still hard to understand as ever. It has taken me multiple readings through the Google research paper that first introduced transformers along with just so many blog posts to really understand how a transformer works. So, I thought of putting the whole idea down in as simple words as possible and with some very basic Math and some puns as I am a proponent of having some fun while learning. I will try to keep both the jargon and the technicality to a minimum, yet it is such a topic that I could only do so much. And my goal is to make the reader understand even the most gory details of Transformer by the end of this post. Also, this is officially my longest post both in terms of time taken to write it as well as length of the post. So, here goes -- This post will be a highly conversational one and it is about "Decoding The Transformer".
Transformer models have become the defacto standard for NLP tasks. As an example, I'm sure you've already seen the awesome GPT3 Transformer demos and articles detailing how much time and money it took to train. But even outside of NLP, you can also find transformers in the fields of computer vision and music generation. That said, for such a useful model, transformers are still very difficult to understand. It took me multiple readings of the Google research paper first introducing transformers, and a host of blog posts to really understand how transformers work. I'll try to keep the jargon and the technicality to a minimum, but do keep in mind that this topic is complicated. I'll also include some basic math and try to keep things light to ensure the long journey is fun. Q: Why should I understand Transformers? In the past, the state of the art approach to language modeling problems (put simply, predicting the next word) and translations systems was the LSTM and GRU architecture (explained here) along with the attention mechanism.
This article is the first installment of a two-post series on Building a machine reading comprehension system using the latest advances in deep learning for NLP. Stay tuned for the second part, where we'll introduce a pre-trained model called BERT that will take your NLP projects to the next level! In the recent past, if you specialized in natural language processing (NLP), there may have been times when you felt a little jealous of your colleagues working in computer vision. It seemed as if they had all the fun: the annual ImageNet classification challenge, Neural Style Transfer, Generative Adversarial Networks, to name a few. At last, the dry spell is over, and the NLP revolution is well underway!
Deep learning has kept evolving throughout the years. And that is an important reason for its reputation. Deep learning practices highly emphasize the use of large buckets of parameters to extract useful information about the dataset we're dealing with. By having a large set of parameters, it becomes easier to classify/detect something as we have more data to identify distinctly. One notable milestone in the journey of Deep Learning so far, and specifically in Natural Language Processing, was the introduction of Language Models that highly improved the accuracy and efficiency of doing various NLP tasks. A sequence-sequence model is an encoder-decoder mechanism-based model that takes a sequence of inputs and returns a sequence of outputs as result.