Prompting Large Vision-Language Models for Compositional Reasoning

Ossowski, Timothy, Jiang, Ming, Hu, Junjie

Jan-20-2024–arXiv.org Artificial Intelligence

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.

caption, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Jan-20-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > Wisconsin > Dane County > Madison (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Leisure & Entertainment > Sports (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language > Large Language Model (1.00)
  - Vision (1.00)