Goto

Collaborating Authors

Understanding Transformers, the Data Science Way - KDnuggets

#artificialintelligence

Transformers have become the defacto standard for any NLP tasks nowadays. Not only that, but they are now also being used in Computer Vision and to generate music. I am sure you would all have heard about the GPT3 Transformer and its applications thereof. But all these things aside, they are still hard to understand as ever. It has taken me multiple readings through the Google research paper that first introduced transformers along with just so many blog posts to really understand how a transformer works. So, I thought of putting the whole idea down in as simple words as possible along with some very basic Math and some puns as I am a proponent of having some fun while learning. I will try to keep both the jargon and the technicality to a minimum, yet it is such a topic that I could only do so much. And my goal is to make the reader understand even the goriest details of Transformer by the end of this post. Also, this is officially my longest post both in terms of time taken to write it as well as the length of the post. Hence, I will advise you to Grab A Coffee.


Drawing the Transformer Network from Scratch (Part 1)

#artificialintelligence

The Transformer Neural Networks -- usually just called "Transformers" -- were introduced by a Google-led team in 2017 in a paper titled "Attention Is All You Need". They were refined and popularized by many people in the following work. Like many models invented before it, the Transformer has an encoder-decoder architecture. In this post, we put our focus on the encoder part. We will successively draw all its parts in a Bottom-top fashion.


Augmenting Self-attention with Persistent Memory

arXiv.org Machine Learning

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.


$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

arXiv.org Machine Learning

Transformer networks use pairwise attention to compute contextual embeddings of inputs, and have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute attention in each layer. This has prompted recent research into faster attention models, with a predominant approach involving sparsifying the connections in the attention layers. While empirically promising for long sequences, fundamental questions remain unanswered: Can sparse transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we address these questions and provide a unifying framework that captures existing sparse attention models. Our analysis proposes sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function. Surprisingly, our results show the existence of models with only $O(n)$ connections per attention layer that can approximate the same function class as the dense model with $n^2$ connections. Lastly, we present experiments comparing different patterns/levels of sparsity on standard NLP tasks.


Understanding Transformers, the Data Science Way

#artificialintelligence

Transformers have become the defacto standard for NLP tasks nowadays. While the Transformer architecture was introduced with NLP, they are now being used in Computer Vision and to generate music as well. I am sure you would all have heard about the GPT3 Transformer and its applications thereof. But all these things aside, they are still hard to understand as ever. It has taken me multiple readings through the Google research paper that first introduced transformers along with just so many blog posts to really understand how a transformer works. So, I thought of putting the whole idea down in as simple words as possible and with some very basic Math and some puns as I am a proponent of having some fun while learning. I will try to keep both the jargon and the technicality to a minimum, yet it is such a topic that I could only do so much. And my goal is to make the reader understand even the most gory details of Transformer by the end of this post. Also, this is officially my longest post both in terms of time taken to write it as well as length of the post. So, here goes -- This post will be a highly conversational one and it is about "Decoding The Transformer".