ViT -- An Image is worth 16x16 words: Transformers for Image Recognition at scale -- ICLR'21
After the blooming success of transformers in NLP, researchers started applying them in the vision domain too, where for high-level tasks like object detection, segmentation, classification still CNN based variants are dominant. Google brain's research team jumped in again and published a paper called Vision Transformers, which you are here for reading a summary of. ViT, didn't give satisfactory results when they were trained on smaller datasets, but outperformed SOTA for object classification, by a few percentage points, when trained on large datasets. Specifically, ViTs were pretty good, when pre-trained on large datasets, and then finetuned on smaller datasets. Pretrained ViTs outperformed EfficientNet and ResNet-based SOTA networks on datasets including ImageNet, Image-Net Real, CIFAR-100, and VTAB-19.
Jan-30-2022, 22:00:21 GMT
- Technology: