iFormer: Integrating ConvNet and Transformer for Mobile Application
–arXiv.org Artificial Intelligence
We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. The local interactions are derived from transforming a standard convolutional network, i.e., ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios. Building lightweight neural networks facilitates real-time analysis of images and videos captured by mobile applications such as smartphones. This not only enhances privacy protection and security by processing data locally on the device but also improves overall user experience. The core mechanism underlying ViTs between our iFormer and other existing is self-attention, which dynamically learns interactions methods on ImageNet-1k. This enables the is measured on an iPhone 13. Our iFormer is model to focus on important regions adaptively and Pareto-optimal. Nevertheless, deploying ViTs on mobile devices with limited resources poses significant challenges.
arXiv.org Artificial Intelligence
Feb-17-2025