Perceptual Grouping in Contrastive Vision-Language Models