The power of Convolution in Vision Transformer

#artificialintelligence 

It is well known today that Transformers are not only used for natural language processing but plays a vital role in computer vision applications in the form of vision transformers (ViT). In fact it has been demonstrated time and time again just how powerful they are as seen by their SOTA performance. However one major drawback of vision transformers is their reliance on huge amounts of data. Another major drawback is thier below average optimizability. It has been shown that vision transformers are very sensitive particularly to the type of optimizer used (Adam vs AdamW vs SGD etc), the choice of learning hyperparameters, depth of the network, training schedule length etc. Researchers have indicated, this particular drawback is as a result of the "patchify stem" which forms the early visual processing layer which is implemented with large kernel and stride sizes (default of 16).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found