EDIT: Enhancing Vision Transformers by Mitigating Attention Sink through an Encoder-Decoder Architecture

Open in new window