Modeling Caption Diversity in Contrastive Vision-Language Pretraining