Goto

Collaborating Authors

 fast vision transformer


Fast Vision Transformers with HiLo Attention

Neural Information Processing Systems

Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies.


Supplementary Material for Fast Vision Transformers with HiLo Attention

Neural Information Processing Systems

Department of Data Science & AI, Monash University, Australia We organize our supplementary material as follows. In Section A, we describe the architecture specifications of LITv2. In Section B, we provide the derivation for the computational cost of HiLo attention. In Section C, we study the effect of window size based on CIFAR-100. In Section F, we provide more visualisation examples for spectrum analysis of HiLo attention. We use "ConvFFN Block" to differentiate our "ConvFFN" denotes our modified FFN layer where we adopt one layer of The overall framework of LITv2 is depicted in Figure I.


Fast Vision Transformers with HiLo Attention

Neural Information Processing Systems

Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies.