Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

Neural Information Processing Systems 

ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition.