Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?