Global Context Vision Transformers -- Nvidia's new SOTA Image Model
Nvidia has recently published a new vision transformer, titled the Global Context Vision Transformer (GC ViT) (Hatamizadeh et al., 2022). GC ViT introduced a novel architecture that leverages both global attention and local attention, allowing it to model both short-range and long-range spatial interactions. The clever techniques used by the Nvidia researchers enabled GC ViT to model global attention while avoiding expensive computations. GC ViT achieves state-of-the-art (SOTA) results in the ImageNet-1K dataset, surpassing the Swin Transformer by a significant margin. In this article, we will take a closer look at the inner workings of GC ViT, and the techniques that enabled it to achieve such results.
Sep-6-2022, 19:00:23 GMT
- Industry:
- Information Technology > Hardware (0.82)
- Technology:
- Information Technology > Artificial Intelligence > Vision (1.00)