MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

Hua, Wei, Zhou, Chenlin, Wu, Jibin, Chua, Yansong, Shu, Yangyang

Jun-19-2025–arXiv.org Artificial Intelligence

The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has garnered significant attention due to their potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT. This novel spike-driven Transformer architecture firstly uses multi-scale spiking attention (MSSA) to enhance the capabilities of spiking attention blocks. We validate our approach across various main datasets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.

machine learning, msvit, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jun-19-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - New Finding (0.48)
  - Promising Solution (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)