CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
Castro, Santiago, Ziai, Amir, Saluja, Avneesh, Yuan, Zhuoning, Mihalcea, Rada
–arXiv.org Artificial Intelligence
Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.
arXiv.org Artificial Intelligence
Feb-29-2024
- Country:
- North America
- Canada > Ontario
- Toronto (0.14)
- United States > Minnesota
- Hennepin County > Minneapolis (0.14)
- Canada > Ontario
- North America
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology > Services (0.34)
- Leisure & Entertainment (0.34)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.47)
- Natural Language (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence