Day by day the number of machine learning models is increasing at a pace. With this increasing rate, it is hard for beginners to choose an effective model to perform Natural Language Understanding (NLU) and Natural Language Generation (NLG) mechanisms. Researchers across the globe are working around the clock to achieve more progress in artificial intelligence to build agile and intuitive sequence-to-sequence learning models. And in recent times transformers are one such model which gained more prominence in the field of machine learning to perform speech-to-text activities. The wide availability of other sequence-to-sequence learning models like RNNs, LSTMs, and GRU always raises a challenge for beginners when they think about transformers.
The famous paper "Attention is all you need" in 2017 changed the way we were thinking about attention. Nonetheless, 2020 was definitely the year of transformers! From natural language now they are into computer vision tasks. How did we go from attention to self-attention? Why does the transformer work so damn well? What are the critical components for its success? Read on and find out! In my opinion, transformers are not so hard to grasp.
Transformer models have become the defacto standard for NLP tasks. As an example, I'm sure you've already seen the awesome GPT3 Transformer demos and articles detailing how much time and money it took to train. But even outside of NLP, you can also find transformers in the fields of computer vision and music generation. That said, for such a useful model, transformers are still very difficult to understand. It took me multiple readings of the Google research paper first introducing transformers, and a host of blog posts to really understand how transformers work. I'll try to keep the jargon and the technicality to a minimum, but do keep in mind that this topic is complicated. I'll also include some basic math and try to keep things light to ensure the long journey is fun. Q: Why should I understand Transformers? In the past, the state of the art approach to language modeling problems (put simply, predicting the next word) and translations systems was the LSTM and GRU architecture (explained here) along with the attention mechanism.
Deep learning has kept evolving throughout the years. And that is an important reason for its reputation. Deep learning practices highly emphasize the use of large buckets of parameters to extract useful information about the dataset we're dealing with. By having a large set of parameters, it becomes easier to classify/detect something as we have more data to identify distinctly. One notable milestone in the journey of Deep Learning so far, and specifically in Natural Language Processing, was the introduction of Language Models that highly improved the accuracy and efficiency of doing various NLP tasks. A sequence-sequence model is an encoder-decoder mechanism-based model that takes a sequence of inputs and returns a sequence of outputs as result.
Traditionally recurrent neural networks and their variants have been used extensively for Natural Language Processing problems. In recent years, transformers have outperformed most RNN models. Before looking at transformers, let's revisit recurrent neural networks, how they work, and where they fall behind. There are different types of recurrent neural networks. When it comes to natural language processing RNNs, they work in an encoder-decoder architecture. Encoders will summarize all the information from the input sentence, and the decoder will use the encoder's output to create the right output.