ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages

Apr-22-2025–arXiv.org Artificial Intelligence

--Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies. T o address these limitations, we propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers. ECViT introduces inductive biases such as locality and translation invariance, inherent to Convolutional Neural Networks (CNNs) into the Transformer framework by extracting patches from low-level features and enhancing the encoder with convolutional operations. Additionally, it incorporates local-attention and a pyramid structure to enable efficient multi-scale feature extraction and representation. Experimental results demonstrate that ECViT achieves an optimal balance between performance and efficiency, outperforming state-of-the-art models on various image classification tasks while maintaining low computational and storage requirements. ECViT offers an ideal solution for applications that prioritize high efficiency without compromising performance. Transformers use self-attention [1] to model long-range dependencies, revolutionizing how models handle sequential data. The Vision Transformer (ViT) [2] treats images as sequences of patches and uses the self-attention to capture global dependencies, which has made a successful transition from natural language processing (NLP) to computer vision (CV).

artificial intelligence, machine learning, transformer, (15 more...)

arXiv.org Artificial Intelligence

Apr-22-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - New Finding (0.66)
  - Promising Solution (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found