EDIT: Enhancing Vision Transformers by Mitigating Attention Sink through an Encoder-Decoder Architecture

Feng, Wenfeng, Wang, Hongxiang, Wang, Jianlong, Zhang, Xin, Zhao, Jingjing, Liang, Yueyue, Chen, Xiang, Han, Duokui

Oct-17-2025–arXiv.org Artificial Intelligence

Abstract: In this paper, we propose EDIT (Encoder - Decoder Image Transformer), a novel architecture designed to mitigate the attention sink phenomenon observed in Vision Transformer (ViT) models. Attention sink occurs when an excessive amount of attention is allocated to the [CLS] token, distorting the model's ability to effectively process image patches. To address this, we introduce a layer - aligned encoder - decoder architecture, where the encoder utilizes self - attention to process image patches, while the decoder uses crossattention to focus on the [CLS] token. Unlike traditional encoder - decoder framework, where the decoder depends solely on high - level encoder representations, EDIT allows the decoder to extract information starting from low - level features, progressively refining the representation layer by layer. EDIT is naturally interpretable demonstrated through sequential attention . I ntroduction Transformer, introduced by Vaswani et al. [1], utilize self - attention and cross - attention mechanisms to extract intrinsic features from text data. Transformer includes both an encoder and a decoder, with the encoder extracting relevant information from input data and the decoder generating outputs based on this representation. Transformer and its improvements have achieved significant success in natural language processing (NLP) tasks [1, 2, 3, 4, 5].

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-17-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language > Machine Translation (0.68)
    - Machine Learning > Neural Networks
      - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found