Collaborating Authors

Deep Transfer Learning for NLP with Transformers


This is arguably the most important architecture for natural language processing (NLP) today. Specifically, we look at modeling frameworks such as the generative pretrained transformer (GPT), bidirectional encoder representations from transformers (BERT) and multilingual BERT (mBERT). These methods employ neural networks with more parameters than most deep convolutional and recurrent neural network models. Despite the larger size, they've exploded in popularity because they scale comparatively more effectively on parallel computing architecture. This enables even larger and more sophisticated models to be developed in practice. Until the arrival of the transformer, the dominant NLP models relied on recurrent and convolutional components. Additionally, the best sequence modeling and transduction problems, such as machine translation, rely on an encoder-decoder architecture with an attention mechanism to detect which parts of the input influence each part of the output. The transformer aims to replace the recurrent and convolutional components entirely with attention.

How Transformers work in deep learning and NLP: an intuitive introduction


The famous paper "Attention is all you need" in 2017 changed the way we were thinking about attention. Nonetheless, 2020 was definitely the year of transformers! From natural language now they are into computer vision tasks. How did we go from attention to self-attention? Why does the transformer work so damn well? What are the critical components for its success? Read on and find out! In my opinion, transformers are not so hard to grasp.

Transformers from scratch


I will assume a basic understanding of neural networks and backpropagation. If you'd like to brush up, this lecture will give you the basics of neural networks and this one will explain how these principles are applied in modern deep learning systems. A working knowledge of Pytorch is required to understand the programming examples, but these can also be safely skipped. The fundamental operation of any transformer architecture is the self-attention operation. Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out. The vectors all have dimension \(k\). A few other ingredients are needed for a complete transformer, which we'll discuss later, but this is the fundamental operation. More importantly, this is the only operation in the whole architecture that propagates information between vectors. Every other operation in the transformer is applied to each vector in the input sequence without interactions between vectors. Despite its simplicity, it's not immediately obvious why self-attention should work so well. To build up some intuition, let's look first at the standard approach to movie recommendation.

The Illustrated Transformer


Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations: Chinese (Simplified), Japanese, Korean, Russian, Spanish Watch: MIT’s Deep Learning State of the Art lecture referencing this post In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions. The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter. A High-Level Look Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

Image Captioning with an End to End Transformer Network.


Transformer Networks are deep learning models that learn context and meaning in sequential data by tracking the relationships between the sequences. Since the introduction of Transformer Networks in 2017 by Google Brain in their revolutionary paper "Attention is all you need", transformers have been outperforming conventional neural networks in various problem domains, like Neural Machine Translation, Text Summarization, Language Understanding, and other Natural Language Processing tasks. Along with this, they have also proved to be quite effective in Computer Vision tasks like Image Classification with Vision Transformers and Generative Networks as well. In this article, I will be trying to elaborate on my understanding of the attention mechanism through vision transformers and on sequence to sequence tasks through Transformer Networks. For problems in the Image Domain, like Image Classification and feature extraction from Images, Deep Convolutional Neural Network architectures like ResNet and Inception are used.