XCiT: Cross-Covariance Image Transformers

Jan-18-2025, 12:29:22 GMT–Neural Information Processing Systems

Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens,i.e. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.Our cross-covariance image transformer (XCiT) is built upon XCA.

cross-covariance image transformer, interaction, xcit, (3 more...)

Neural Information Processing Systems

Jan-18-2025, 12:29:22 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence > Natural Language (1.00)