Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?

Open in new window