FocalModulationNetworks
–Neural Information Processing Systems
Focal modulation comprises three components:(i)hierarchical contextualization, implemented using astackofdepth-wise convolutional layers, to encode visual contexts from short to long ranges,(ii) gated aggregation toselectivelygather contextsforeach query tokenbased onitscontent, and (iii) element-wise modulation or affine transformation to fuse the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational cost on the tasks ofimage classification, object detection, and semantic segmentation.
Neural Information Processing Systems
Feb-7-2026, 18:05:13 GMT
- Technology: