Revisiting the Integration of Convolution and Attention for Vision Backbone
–Neural Information Processing Systems
Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity.
Neural Information Processing Systems
Feb-12-2026, 15:51:09 GMT