Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
Beňová, Ivana, Košecká, Jana, Gregor, Michal, Tamajka, Martin, Veselý, Marcel, Šimko, Marián
–arXiv.org Artificial Intelligence
The dominant probing approaches rely on the zero-shot performance of image-text matching tasks to gain a finer-grained understanding of the representations learned by recent multimodal image-language transformer models. The evaluation is carried out on carefully curated datasets focusing on counting, relations, attributes, and others. This work introduces an alternative probing strategy called guided masking. The proposed approach ablates different modalities using masking and assesses Figure 1: Image from the SVO-Probes dataset (Hendricks the model's ability to predict the masked word and Nematzadeh, 2021). It consists of imagecaption with high accuracy. We focus on studying pairs, where the sentence either correctly describes multimodal models that consider regions of the image (positive example) or one aspect of interest (ROI) features obtained by object detectors the sentence (subject, verb, or object) does not match as input tokens. We probe the understanding the image (negative example). These pairs are used to of verbs using guided masking on probe models through zero-shot image-text matching. ViLBERT, LXMERT, UNITER, and Visual-Example of a positive caption: A person walking on BERT and show that these models can predict a trail.
arXiv.org Artificial Intelligence
Jan-29-2024