CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

Castro, Santiago, Ziai, Amir, Saluja, Avneesh, Yuan, Zhuoning, Mihalcea, Rada

Feb-29-2024–arXiv.org Artificial Intelligence

Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.

caption, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Feb-29-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Canada > Ontario
    - Toronto (0.14)
  - United States > Minnesota
    - Hennepin County > Minneapolis (0.14)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Services (0.34)
- Leisure & Entertainment (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.47)
  - Natural Language (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found