XCiT: Cross-Covariance Image Transformers

Neural Information Processing Systems 

Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens,i.e.