Transformers in Vision: A Survey

Khan, Salman, Naseer, Muzammal, Hayat, Munawar, Zamir, Syed Waqas, Khan, Fahad Shahbaz, Shah, Mubarak

Jan-4-2021–arXiv.org Artificial Intelligence

Astounding results from transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. This has led to exciting progress on a number of tasks while requiring minimal inductive biases in the model design. This survey aims to provide a comprehensive overview of the transformer models in the computer vision discipline and assumes little to no prior background in the field. We start with an introduction to fundamental concepts behind the success of transformer models i.e., self-supervision and self-attention. Transformer architectures leverage self-attention mechanisms to encode long-range dependencies in the input domain which makes them highly expressive. Since they assume minimal prior knowledge about the structure of the problem, self-supervision using pretext tasks is applied to pre-train transformer models on large-scale (unlabelled) datasets. The learned representations are then fine-tuned on the downstream tasks, typically leading to excellent performance due to the generalization and expressivity of encoded features. We cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering and visual reasoning), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

arxiv preprint arxiv, dataset, transformer, (13 more...)

arXiv.org Artificial Intelligence

Jan-4-2021

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Australian Capital Territory > Canberra (0.04)
- North America > United States
  - Massachusetts (0.04)
  - Illinois > Cook County
    - Chicago (0.04)
  - Florida > Orange County
    - Orlando (0.14)
- Europe > Sweden
  - Östergötland County > Linköping (0.04)
- Asia > Middle East
  - UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre:
- Research Report (1.00)
- Overview (1.00)

Industry:
- Energy (0.45)
- Health & Medicine > Therapeutic Area (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Statistical Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found