Focal Modulation Networks

May-1-2026, 01:51:05 GMT–Neural Information Processing Systems

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation module for modeling token interactions in vision. Focal modulation comprises three components: (i)hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to fuse the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K.

arxiv preprint arxiv, machine learning, natural language, (17 more...)

Neural Information Processing Systems

May-1-2026, 01:51:05 GMT

Conferences PDF

Add feedback

Genre:
- Research Report (0.46)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
FocalModulationNetworks

Similar Docs Excel Report more

Title	Similarity	Source
None found