Collaborating Authors

Deep Transfer Learning for NLP with Transformers


This is arguably the most important architecture for natural language processing (NLP) today. Specifically, we look at modeling frameworks such as the generative pretrained transformer (GPT), bidirectional encoder representations from transformers (BERT) and multilingual BERT (mBERT). These methods employ neural networks with more parameters than most deep convolutional and recurrent neural network models. Despite the larger size, they've exploded in popularity because they scale comparatively more effectively on parallel computing architecture. This enables even larger and more sophisticated models to be developed in practice. Until the arrival of the transformer, the dominant NLP models relied on recurrent and convolutional components. Additionally, the best sequence modeling and transduction problems, such as machine translation, rely on an encoder-decoder architecture with an attention mechanism to detect which parts of the input influence each part of the output. The transformer aims to replace the recurrent and convolutional components entirely with attention.



Transformer models have become the go-to model in most of the NLP tasks. Many transformer-based models like BERT, ROBERTa, GPT series, etc are considered as the state-of-the-art models in NLP. While NLP is overwhelming with all these models, Transformers are gaining popularity in Computer vision also. Transformers are now used for recognizing and constructing images, image encoding, and many more. While transformer models are taking over the AI field, it is also important to have a low-level understanding of these models.

How Transformers work in deep learning and NLP: an intuitive introduction


The famous paper "Attention is all you need" in 2017 changed the way we were thinking about attention. Nonetheless, 2020 was definitely the year of transformers! From natural language now they are into computer vision tasks. How did we go from attention to self-attention? Why does the transformer work so damn well? What are the critical components for its success? Read on and find out! In my opinion, transformers are not so hard to grasp.

Essential Guide to Transformer Models in Machine Learning


Transformer models have become the defacto standard for NLP tasks. As an example, I'm sure you've already seen the awesome GPT3 Transformer demos and articles detailing how much time and money it took to train. But even outside of NLP, you can also find transformers in the fields of computer vision and music generation. That said, for such a useful model, transformers are still very difficult to understand. It took me multiple readings of the Google research paper first introducing transformers, and a host of blog posts to really understand how transformers work. I'll try to keep the jargon and the technicality to a minimum, but do keep in mind that this topic is complicated. I'll also include some basic math and try to keep things light to ensure the long journey is fun. Q: Why should I understand Transformers? In the past, the state of the art approach to language modeling problems (put simply, predicting the next word) and translations systems was the LSTM and GRU architecture (explained here) along with the attention mechanism.

The Annotated Transformer


The Transformer from "Attention is All You Need" has been on a lot of people's minds over the last year. Besides producing major improvements in translation quality, it provides a new architecture for many other NLP tasks. The paper itself is very clearly written, but the conventional wisdom has been that it is quite difficult to implement correctly. In this post I present an "annotated" version of the paper in the form of a line-by-line implementation. I have reordered and deleted some sections from the original paper and added comments throughout. This document itself is a working notebook, and should be a completely usable implementation.