Measuring Progress in Fine-grained Vision-and-Language Understanding
Bugliarello, Emanuele, Sartran, Laurent, Agrawal, Aishwarya, Hendricks, Lisa Anne, Nematzadeh, Aida
–arXiv.org Artificial Intelligence
First we consider: Which models perform well Fine-grained multimodal skills (e.g., understanding on fine-grained tasks? To answer this, we evaluate relationships and recognising verbs) require identifying models from four different model families trained and relating various entities across both image with different amounts of pretraining data, as well and text modalities. Vision-and-language models as recent architectures that leverage frozen large (VLMs) need such skills to robustly perform language models (LLMs). We observe that modelling well on real-world vision-and-language (V&L) applications; innovations have more impact than simply e.g., a coarse-grained model tested on scaling image captions from the Web. Furthermore, image retrieval to "find an image where something explicitly modelling localisation can improve is on a sofa" might incorrectly return an image of performance, but it is crucial how it is done, a cat sitting below the sofa. As another example, and simply using localisation data is not enough. in captioning, a model might incorrectly describe Our observations motivate our next question: an image where "someone is selling a sweater" as How do data and losses impact fine-grained understanding? "someone is buying a sweater," if it does not have a We focus our study on the best performing precise understanding of the two verbs.
arXiv.org Artificial Intelligence
May-12-2023
- Country:
- Asia > Middle East (0.67)
- North America > United States
- Minnesota (0.28)
- Genre:
- Research Report > New Finding (0.93)
- Technology: