AITopics | hierarchical vit

In essence, MAE proposed an asymmetric encoder-decoder architecture for MIM, where the encoder (e.g., a standard ViT model [17])operates onlyonvisible patches, andthelight-weight decoder recoversallpatches for maskprediction.

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Green Hierarchical Vision Transformer for Masked Image Modeling

Neural Information Processing SystemsDec-24-2025, 14:13:32 GMT

We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of three key designs. First, for window attention, we propose a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. Third, as for the convolution layers, we convert them to the Sparse Convolution that works seamlessly with the sparse data, i.e., the visible patches in MIM. As a result, MIM can now work on most, if not all, hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs, e.g., Swin Transformer and Twins Transformer, about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks.

green hierarchical vision transformer, masked image modeling, name change, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

MG-ViT: A Multi-Granularity Method for Compact and Efficient Vision Transformers Y u Zhang

Neural Information Processing SystemsOct-9-2025, 09:08:48 GMT

Vision Transformer (ViT) faces obstacles in wide application due to its huge computational cost.

artificial intelligence, machine learning, transformer, (19 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

7e487c72fce6e45879a78ee0872d991d-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-16-2025, 09:40:32 GMT

artificial intelligence, machine learning, visible patch, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.31)

Add feedback

7e487c72fce6e45879a78ee0872d991d-Paper-Conference.pdf

Neural Information Processing SystemsAug-16-2025, 09:40:28 GMT

arxiv preprint arxiv, machine learning, natural language, (13 more...)

Neural Information Processing Systems

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Green Hierarchical Vision Transformer for Masked Image Modeling

Neural Information Processing SystemsJan-15-2025, 11:42:41 GMT

We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of three key designs. First, for window attention, we propose a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches.

green hierarchical vision transformer, hierarchical vit, masked image modeling, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Plug n' Play: Channel Shuffle Module for Enhancing Tiny Vision Transformers

Xu, Xuwei, Wang, Sen, Chen, Yudong, Liu, Jiajun

arXiv.org Artificial IntelligenceOct-9-2023

Vision Transformers (ViTs) have demonstrated remarkable performance in various computer vision tasks. However, the high computational complexity hinders ViTs' applicability on devices with limited memory and computing resources. Although certain investigations have delved into the fusion of convolutional layers with self-attention mechanisms to enhance the efficiency of ViTs, there remains a knowledge gap in constructing tiny yet effective ViTs solely based on the self-attention mechanism. Furthermore, the straightforward strategy of reducing the feature channels in a large but outperforming ViT often results in significant performance degradation despite improved efficiency. To address these challenges, we propose a novel channel shuffle module to improve tiny-size ViTs, showing the potential of pure self-attention models in environments with constrained computing resources. Inspired by the channel shuffle design in ShuffleNetV2 \cite{ma2018shufflenet}, our module expands the feature channels of a tiny ViT and partitions the channels into two groups: the \textit{Attended} and \textit{Idle} groups. Self-attention computations are exclusively employed on the designated \textit{Attended} group, followed by a channel shuffle operation that facilitates information exchange between the two groups. By incorporating our module into a tiny ViT, we can achieve superior performance while maintaining a comparable computational complexity to the vanilla model. Specifically, our proposed channel shuffle module consistently improves the top-1 accuracy on the ImageNet-1K dataset for various tiny ViT models by up to 2.8\%, with the changes in model complexity being less than 0.03 GMACs.

artificial intelligence, machine learning, module, (13 more...)

arXiv.org Artificial Intelligence

2310.05642

Country: Oceania > Australia > Queensland > Brisbane (0.05)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Green Hierarchical Vision Transformer for Masked Image Modeling

Huang, Lang, You, Shan, Zheng, Mingkai, Wang, Fei, Qian, Chen, Yamasaki, Toshihiko

arXiv.org Artificial IntelligenceOct-14-2022

We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of three key designs. First, for window attention, we propose a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. Third, as for the convolution layers, we convert them to the Sparse Convolution that works seamlessly with the sparse data, i.e., the visible patches in MIM. As a result, MIM can now work on most, if not all, hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs, e.g., Swin Transformer and Twins Transformer, about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.

artificial intelligence, arxiv preprint arxiv, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2205.13515

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report > New Finding (0.67)

Technology: