Adding simple structure at inference improves Vision-Language Compositionality

Open in new window