Swin Transformer ๐: Hierarchical Vision Transformer using Shifted Window -- Part I
So Facebook AI's team came up with DeiT, which is a data-efficient transformer and was able to out-perform SOTA convolutional networks and ViTs, in terms of accuracy/FLOPs trade-off. DeiT was trained on no external data but just ImageNet21. But it used distillation and depended on a convolution network for knowledge distillation, so was not completely a convolution-free solution. Both DeiT and ViT, were just tested and designed for Image classification, with the general perception that, if a network architecture performs good for the image classification task, it is expected to do good on others because, "image classification is used as a benchmark for measuring the progress of a technique in the vision domain, any progress here translates to downstream tasks like detection and segmentation". There is no other work in my knowledge, that used ViT or DeiT as a feature extraction backbone, for tasks other than classification.
Feb-13-2022, 07:40:15 GMT