Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization Huaibo Huang 1,2, Ran He

May-30-2025, 06:52:40 GMT–Neural Information Processing Systems

We present a general vision transformer backbone, called as Orthogonal Transformer, in pursuit of both efficiency and effectiveness. A major challenge for vision transformer is that self-attention, as the key element in capturing long-range dependency, is very computationally expensive for dense prediction tasks (e.g., object detection). Coarse global self-attention and local self-attention are then designed to reduce the cost, but they suffer from either neglecting local correlations or hurting global modeling. We present an orthogonal self-attention mechanism to alleviate these issues. Specifically, self-attention is computed in the orthogonal space that is reversible to the spatial domain but has much lower resolution.

artificial intelligence, machine learning, transformer, (17 more...)

Neural Information Processing Systems

May-30-2025, 06:52:40 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.29)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)
  - Vision (1.00)