ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases