EDIT: Enhancing Vision Transformers by Mitigating Attention Sink through an Encoder-Decoder Architecture

Feng, Wenfeng, Wang, Hongxiang, Wang, Jianlong, Zhang, Xin, Zhao, Jingjing, Liang, Yueyue, Chen, Xiang, Han, Duokui

arXiv.org Artificial Intelligence 

Abstract: In this paper, we propose EDIT (Encoder - Decoder Image Transformer), a novel architecture designed to mitigate the attention sink phenomenon observed in Vision Transformer (ViT) models. Attention sink occurs when an excessive amount of attention is allocated to the [CLS] token, distorting the model's ability to effectively process image patches. To address this, we introduce a layer - aligned encoder - decoder architecture, where the encoder utilizes self - attention to process image patches, while the decoder uses crossattention to focus on the [CLS] token. Unlike traditional encoder - decoder framework, where the decoder depends solely on high - level encoder representations, EDIT allows the decoder to extract information starting from low - level features, progressively refining the representation layer by layer. EDIT is naturally interpretable demonstrated through sequential attention . I ntroduction Transformer, introduced by Vaswani et al. [1], utilize self - attention and cross - attention mechanisms to extract intrinsic features from text data. Transformer includes both an encoder and a decoder, with the encoder extracting relevant information from input data and the decoder generating outputs based on this representation. Transformer and its improvements have achieved significant success in natural language processing (NLP) tasks [1, 2, 3, 4, 5].