GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation

Yang, Chenhongyi, Xu, Jiarui, De Mello, Shalini, Crowley, Elliot J., Wang, Xiaolong

Apr-25-2023–arXiv.org Artificial Intelligence

We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require highresolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameters. In the original ViT architecture, image patches are passed through transformer encoder layers, each containing self-attention and MLP blocks. The spatial resolution of the image patches is constant throughout the network. This design reduces computational cost, and improves ViT performance on tasks such as detection and segmentation.

artificial intelligence, machine learning, transformer, (13 more...)

arXiv.org Artificial Intelligence

Apr-25-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found