Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization Huaibo Huang 1,2, Ran He
–Neural Information Processing Systems
We present a general vision transformer backbone, called as Orthogonal Transformer, in pursuit of both efficiency and effectiveness. A major challenge for vision transformer is that self-attention, as the key element in capturing long-range dependency, is very computationally expensive for dense prediction tasks (e.g., object detection). Coarse global self-attention and local self-attention are then designed to reduce the cost, but they suffer from either neglecting local correlations or hurting global modeling. We present an orthogonal self-attention mechanism to alleviate these issues. Specifically, self-attention is computed in the orthogonal space that is reversible to the spatial domain but has much lower resolution.
Neural Information Processing Systems
May-30-2025, 06:52:40 GMT
- Technology: