Yun, Chulhee, Chang, Yin-Wen, Bhojanapalli, Srinadh, Rawat, Ankit Singh, Reddi, Sashank J., Kumar, Sanjiv

Transformer networks use pairwise attention to compute contextual embeddings of inputs, and have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute attention in each layer. This has prompted recent research into faster attention models, with a predominant approach involving sparsifying the connections in the attention layers. While empirically promising for long sequences, fundamental questions remain unanswered: Can sparse transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we address these questions and provide a unifying framework that captures existing sparse attention models. Our analysis proposes sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function. Surprisingly, our results show the existence of models with only $O(n)$ connections per attention layer that can approximate the same function class as the dense model with $n^2$ connections. Lastly, we present experiments comparing different patterns/levels of sparsity on standard NLP tasks.

Transformers have become the defacto standard for NLP tasks nowadays. While the Transformer architecture was introduced with NLP, they are now being used in Computer Vision and to generate music as well. I am sure you would all have heard about the GPT3 Transformer and its applications thereof. But all these things aside, they are still hard to understand as ever. It has taken me multiple readings through the Google research paper that first introduced transformers along with just so many blog posts to really understand how a transformer works. So, I thought of putting the whole idea down in as simple words as possible and with some very basic Math and some puns as I am a proponent of having some fun while learning. I will try to keep both the jargon and the technicality to a minimum, yet it is such a topic that I could only do so much. And my goal is to make the reader understand even the most gory details of Transformer by the end of this post. Also, this is officially my longest post both in terms of time taken to write it as well as length of the post. So, here goes -- This post will be a highly conversational one and it is about "Decoding The Transformer".

By replacing the attention sublayer with linear transformations, we are able to reduce the complexity and memory footprint of the Transformer architecture. We show that FNet offers an excellent compromise between speed, memory footprint, and accuracy, achieving 92% of the accuracy of BERT in a common classification transfer learning setup on the GLUE benchmark (Wang et al., 2018), but training seven times as fast on GPUs and twice as fast on TPUs Recent ML papers have been targeted at fiddling with transformer layers. It is quite interesting to see what works and what doesn't (even though we probably only see what works from those papers). Due to the significant usage of transformers, I think the last 6–12 months have been about optimizing them. This paper review is going to talk about changing the layers to improve the training speed, and the most interesting part is that it was done using Fourier transforms. Fourier Transform is a mathematical concept that can decompose a signal into its constituent frequencies.

Transformers have become the defacto standard for any NLP tasks nowadays. Not only that, but they are now also being used in Computer Vision and to generate music. I am sure you would all have heard about the GPT3 Transformer and its applications thereof. But all these things aside, they are still hard to understand as ever. It has taken me multiple readings through the Google research paper that first introduced transformers along with just so many blog posts to really understand how a transformer works. So, I thought of putting the whole idea down in as simple words as possible along with some very basic Math and some puns as I am a proponent of having some fun while learning. I will try to keep both the jargon and the technicality to a minimum, yet it is such a topic that I could only do so much. And my goal is to make the reader understand even the goriest details of Transformer by the end of this post. Also, this is officially my longest post both in terms of time taken to write it as well as the length of the post. Hence, I will advise you to Grab A Coffee.

Sukhbaatar, Sainbayar, Grave, Edouard, Lample, Guillaume, Jegou, Herve, Joulin, Armand

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.