Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

Dec-9-2025–arXiv.org Artificial Intelligence

Abstract--We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. ISION Transformers (ViTs) [1] have have fundamentally transformed computer vision by borrowing the transformer architecture from natural language models [2]. Unlike Convolutional Neural Networks (CNNs) [3], which rely on local receptive fields to capture image features, ViTs leverage self-attention mechanisms to get global dependencies, demonstrating remarkable performance on vision tasks. To tackle the challenge of quadratic complexity in Vision Transformers (ViTs) and enhance their efficiency while maintaining performance, various approaches have been proposed. Hierarchical Designs such as PVT [4] and Twins [5] utilize multi-scale feature pyramids to progressively reduce spatial dimensions. Hybrid CNN-Transformer Architectures like Con-ViT [6] and CoAtNet [7] combine convolutional operations with self-attention to leverage the strengths of both paradigms. Efficient Token Fusion strategies such as TokenLearner [8] dynamically aggregate tokens to reduce sequence length, while Sparse Attention Patterns exemplified by Reformer [9] utilize locality-sensitive hashing to attend only to relevant tokens. Additionally, efficient implementations like Performer [10] approximate attention through kernel methods to achieve linear complexity.

artificial intelligence, convolution, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Dec-9-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China
    - Jiangsu Province > Nanjing (0.04)
    - Shandong Province > Qingdao (0.04)
    - Shanghai > Shanghai (0.04)
  - Russia (0.04)
- Europe
  - Russia > Central Federal District
    - Moscow Oblast > Moscow (0.04)
  - Switzerland > Zürich
    - Zürich (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found