Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

Open in new window