CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Ma, Zixian, Hong, Jerry, Gul, Mustafa Omer, Gandhi, Mona, Gao, Irena, Krishna, Ranjay
–arXiv.org Artificial Intelligence
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.
arXiv.org Artificial Intelligence
May-16-2023
- Country:
- North America
- Dominican Republic (0.04)
- United States
- Pennsylvania (0.04)
- Washington > King County
- Seattle (0.04)
- California > Santa Clara County
- Palo Alto (0.04)
- Canada > British Columbia
- Europe
- Asia > China
- Hong Kong (0.04)
- North America
- Genre:
- Research Report (0.63)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Natural Language
- Large Language Model (0.49)
- Chatbot (0.48)
- Machine Learning
- Performance Analysis > Accuracy (0.46)
- Neural Networks > Deep Learning (0.34)
- Information Technology > Artificial Intelligence