DMFormer: Closing the Gap Between CNN and Vision Transformers

Wei, Zimian, Pan, Hengyue, Li, Lujun, Lu, Menglong, Niu, Xin, Dong, Peijie, Li, Dongsheng

Nov-28-2022–arXiv.org Artificial Intelligence

Vision transformers have shown excellent performance in computer vision tasks. As the computation cost of their self-attention mechanism is expensive, recent works tried to replace the self-attention mechanism in vision transformers with convolutional operations, which is more efficient with built-in inductive bias. However, these efforts either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a Dynamic Multi-level Attention mechanism (DMA), which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on DMA, we present an efficient backbone network named DMFormer. DMFormer adopts the overall architecture of vision transformers, while replacing the self-attention mechanism with our proposed DMA. Extensive experimental results on ImageNet-1K and ADE20K datasets demonstrated that DMFormer achieves state-of-the-art performance, which outperforms similar-sized vision transformers(ViTs) and convolutional neural networks (CNNs).

artificial intelligence, dmformer, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Nov-28-2022

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found