Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Tewel, Yoad, Shalev, Yoav, Schwartz, Idan, Wolf, Lior

Nov-29-2021–arXiv.org Artificial Intelligence

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests.

arithmetic, caption, knowledge, (15 more...)

arXiv.org Artificial Intelligence

Nov-29-2021

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Australian Capital Territory > Canberra (0.04)
- North America
  - United States > Massachusetts
    - Middlesex County > Cambridge (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Germany (0.28)
  - Italy (0.04)
  - France (0.04)
  - United Kingdom (0.04)
- Asia
  - China (0.28)
  - Middle East > Israel
    - Tel Aviv District > Tel Aviv (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Africa > Middle East
  - Egypt (0.04)

Genre:
- Research Report (1.00)

Industry:
- Government > Regional Government > Europe Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)