Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Zhang, Pengchuan, Dai, Xiyang, Yang, Jianwei, Xiao, Bin, Yuan, Lu, Zhang, Lei, Gao, Jianfeng

Mar-29-2021–arXiv.org Artificial Intelligence

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The second is the attention mechanism of vision Longformer, which is a variant of Longformer \cite{beltagy2020longformer}, originally developed for natural language processing, and achieves a linear complexity w.r.t. the number of input tokens. A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work \cite{wang2021pyramid}, on a range of vision tasks, including image classification, object detection, and segmentation. The models and source code used in this study will be released to public soon.

attention mechanism, mechanism, vision longformer, (15 more...)

arXiv.org Artificial Intelligence

Mar-29-2021

arXiv.org PDF

Add feedback

Country:
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre:
- Research Report > New Finding (0.66)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language (1.00)
    - Machine Learning > Neural Networks (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found