In computer vision applications, attention is either applied along with CNNs or used to replace certain components of these convolutional networks while keeping their overall structure in place. But convolutional architectures still remain dominant. The paper titled, 'An image is worth 16X16 words' was discussed by the likes of Tesla AI head, Andrej Karpathy, among many others. Ever since the seminal paper "Attention Is All You Need," transformers have rekindled the interest in language models. While the transformer architecture has become the go-to solution for many natural language processing tasks, its applications to computer vision remain limited.
Deep convolutional neural networks have been the state of the art for most of the vision tasks. Image classification, image segmentation, object detection are some of the major applications of convolutional networks. Recently, image-to-image translation became the focus of attention since it has a wide range of applications like photo enhancement, object transfiguration, and semantic segmentation. Image-toimage translation can be applied where an image can be mapped into another one. The success of the Convolutional networks on the majority of vision tasks made them the go-to choice for image-to-image translation too.
The field of Computer Vision has for years been dominated by Convolutional Neural Networks (CNNs). Through the use of filters, these networks are able to generate simplified versions of the input image by creating feature maps that highlight the most relevant parts. These features are then used by a multi-layer perceptron to perform the desired classification. But recently this field has been incredibly revolutionized by the architecture of Vision Transformers (ViT), which through the mechanism of self-attention has proven to obtain excellent results on many tasks. If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
Image segmentation is often ambiguous at the level of individual image patches and requires contextual information to reach label consensus. In this paper we introduce Segmenter, a transformer model for semantic segmentation. In contrast to convolution based approaches, our approach allows to model global context already at the first layer and throughout the network. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. To do so, we rely on the output embeddings corresponding to image patches and obtain class labels from these embeddings with a point-wise linear decoder or a mask transformer decoder. We leverage models pre-trained for image classification and show that we can fine-tune them on moderate sized datasets available for semantic segmentation. The linear decoder allows to obtain excellent results already, but the performance can be further improved by a mask transformer generating class masks. We conduct an extensive ablation study to show the impact of the different parameters, in particular the performance is better for large models and small patch sizes. Segmenter attains excellent results for semantic segmentation. It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.