Goto

Collaborating Authors

 efficientformer



EfficientFormer: Vision Transformers at MobileNet Speed

Neural Information Processing Systems

Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance?


Appendix A Latency Driven Slimming Algorithm

Neural Information Processing Systems

We provide the details of the proposed latency-driven fast slimming in Alg. 1. Formulations of the Our major conclusions and speed analysis can be found in Sec. 3 and Figure 1. Compared to non-overlap large-kernel patch embedding (V5 in Tab. MHSA with the global receptive field is an essential contribution to model performance. By comparing V1 and V2 in Tab. 3, we can observe that the GN We explore ReLU and HardSwish (V3 and V4 in Tab. 3) in addition to GeLU We draw a conclusion that the activation function can be selected on a case-by-case basis depending on the specific hardware and compiler. In this work, we use GeLU to provide better performance than ReLU while executing faster.


EfficientFormer: Vision Transformers at MobileNet Speed Y anyu Li

Neural Information Processing Systems

Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer.


EfficientFormer: Vision Transformers at MobileNet Speed

Neural Information Processing Systems

Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance?


Spatial Gated Multi-Layer Perceptron for Land Use and Land Cover Mapping

Jamali, Ali, Roy, Swalpa Kumar, Hong, Danfeng, Atkinson, Peter M, Ghamisi, Pedram

arXiv.org Artificial Intelligence

Convolutional Neural Networks (CNNs) are models that are utilized extensively for the hierarchical extraction of features. Vision transformers (ViTs), through the use of a self-attention mechanism, have recently achieved superior modeling of global contextual information compared to CNNs. However, to realize their image classification strength, ViTs require substantial training datasets. Where the available training data are limited, current advanced multi-layer perceptrons (MLPs) can provide viable alternatives to both deep CNNs and ViTs. In this paper, we developed the SGU-MLP, a learning algorithm that effectively uses both MLPs and spatial gating units (SGUs) for precise land use land cover (LULC) mapping. Results illustrated the superiority of the developed SGU-MLP classification algorithm over several CNN and CNN-ViT-based models, including HybridSN, ResNet, iFormer, EfficientFormer and CoAtNet. The proposed SGU-MLP algorithm was tested through three experiments in Houston, USA, Berlin, Germany and Augsburg, Germany. The SGU-MLP classification model was found to consistently outperform the benchmark CNN and CNN-ViT-based algorithms. For example, for the Houston experiment, SGU-MLP significantly outperformed HybridSN, CoAtNet, Efficientformer, iFormer and ResNet by approximately 15%, 19%, 20%, 21%, and 25%, respectively, in terms of average accuracy. The code will be made publicly available at https://github.com/aj1365/SGUMLP


Asif Razzaq on LinkedIn: #tech #ai #artificialintelligence

#artificialintelligence

Snap and Northeastern University Researchers Propose EfficientFormer: A Vision Transformer That Runs As Fast As MobileNet While Maintaining High Performance In natural language processing, the Transformer is a unique design that seeks to solve sequence-to-sequence tasks while also resolving long-range dependencies. Vision Transformers (ViT) have demonstrated excellent results on computer vision benchmarks in recent years. On the other hand, they are usually times slower than lightweight convolutional networks because of the large number of parameters and model architecture, such as the attention mechanism. As a result, deploying ViT for real-time applications is difficult, especially on hardware with limited resources, such as mobile devices. Snap Inc. and Northeastern University collaborated on a new study that answers this fundamental question and suggests a new ViT paradigm.