ViT -- An Image is worth 16x16 words: Transformers for Image Recognition at scale -- ICLR'21

Jan-30-2022, 22:00:21 GMT–#artificialintelligence

After the blooming success of transformers in NLP, researchers started applying them in the vision domain too, where for high-level tasks like object detection, segmentation, classification still CNN based variants are dominant. Google brain's research team jumped in again and published a paper called Vision Transformers, which you are here for reading a summary of. ViT, didn't give satisfactory results when they were trained on smaller datasets, but outperformed SOTA for object classification, by a few percentage points, when trained on large datasets. Specifically, ViTs were pretty good, when pre-trained on large datasets, and then finetuned on smaller datasets. Pretrained ViTs outperformed EfficientNet and ResNet-based SOTA networks on datasets including ImageNet, Image-Net Real, CIFAR-100, and VTAB-19.

architecture, dataset, transformer, (15 more...)

#artificialintelligence

Jan-30-2022, 22:00:21 GMT

News Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks (1.00)
  - Pattern Recognition > Image Matching (0.40)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found