Inception Transformer
–Neural Information Processing Systems
Recent studies show that transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose \textit{Inception Transformer}, or \textit{iFormer} for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e., gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
Neural Information Processing Systems
Jan-17-2025, 23:01:15 GMT
- Genre:
- Research Report (0.60)
- Technology:
- Information Technology > Artificial Intelligence > Vision (0.60)