Goto

Collaborating Authors

 gfnet







Global Filter Networks for Image Classification

Neural Information Processing Systems

Recent advances in self-attention and pure multi-layer perceptrons (MLP) models for vision have shown great potential in achieving promising performance with fewer inductive biases. These models are generally based on learning interaction among spatial locations from raw data. The complexity of self-attention and MLP grows quadratically as the image size increases, which makes these models hard to scale up when high-resolution features are required. In this paper, we present the Global Filter Network (GFNet), a conceptually simple yet computationally efficient architecture, that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our architecture replaces the self-attention layer in vision transformers with three key operations: a 2D discrete Fourier transform, an element-wise multiplication between frequency-domain features and learnable global filters, and a 2D inverse Fourier transform.




Using Bounding Boxes

Neural Information Processing Systems

We agree that it is a nice idea to exploit the bounding boxes of ImageNet, and are happy to explore it in GFNet. We will add more comparisons on this point in our revision. The iterative process is indeed not indispensable, but in experiments, it improves the accuracy (e.g., We will make these clear in revision. We will release all the code and pre-trained models upon the acceptance of this paper. Table 1: Results using 32x32 (left) and 64x64 (right) patches.