Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking

Beňová, Ivana, Košecká, Jana, Gregor, Michal, Tamajka, Martin, Veselý, Marcel, Šimko, Marián

Jan-29-2024–arXiv.org Artificial Intelligence

The dominant probing approaches rely on the zero-shot performance of image-text matching tasks to gain a finer-grained understanding of the representations learned by recent multimodal image-language transformer models. The evaluation is carried out on carefully curated datasets focusing on counting, relations, attributes, and others. This work introduces an alternative probing strategy called guided masking. The proposed approach ablates different modalities using masking and assesses Figure 1: Image from the SVO-Probes dataset (Hendricks the model's ability to predict the masked word and Nematzadeh, 2021). It consists of imagecaption with high accuracy. We focus on studying pairs, where the sentence either correctly describes multimodal models that consider regions of the image (positive example) or one aspect of interest (ROI) features obtained by object detectors the sentence (subject, verb, or object) does not match as input tokens. We probe the understanding the image (negative example). These pairs are used to of verbs using guided masking on probe models through zero-shot image-text matching. ViLBERT, LXMERT, UNITER, and Visual-Example of a positive caption: A person walking on BERT and show that these models can predict a trail.

artificial intelligence, caption, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jan-29-2024

arXiv.org PDF

Add feedback

Country:
- Europe
  - Czechia (0.14)
  - Slovakia (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)