ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy