What You See is What You Read? Improving Text-Image Alignment Evaluation

Yarom, Michal, Bitton, Yonatan, Changpinyo, Soravit, Aharoni, Roee, Herzig, Jonathan, Lang, Oran, Ofek, Eran, Szpektor, Idan

Dec-26-2023–arXiv.org Artificial Intelligence

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

arxiv preprint arxiv, caption, dataset, (14 more...)

arXiv.org Artificial Intelligence

Dec-26-2023

arXiv.org PDF

Add feedback

Country:
- Oceania
  - New Zealand (0.04)
  - Australia > Victoria
    - Melbourne (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - Washington > King County
      - Seattle (0.04)
    - Texas > Travis County
      - Austin (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
- Europe
  - Switzerland > Zürich
    - Zürich (0.14)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia > Middle East
  - Israel > Jerusalem District > Jerusalem (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)