Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions
Cafagna, Michele, van Deemter, Kees, Gatt, Albert
–arXiv.org Artificial Intelligence
Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.
arXiv.org Artificial Intelligence
Nov-10-2022
- Country:
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- Asia > China
- Hong Kong (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy > Tuscany
- Florence (0.04)
- Middle East > Malta (0.04)
- Spain > Andalusia
- Granada Province > Granada (0.04)
- Belgium > Brussels-Capital Region
- North America
- Canada > British Columbia
- Vancouver (0.04)
- Dominican Republic (0.04)
- United States
- Colorado > Denver County
- Denver (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Colorado > Denver County
- Canada > British Columbia
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Victoria > Melbourne (0.04)
- Africa > Ethiopia
- Genre:
- Research Report (0.82)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.93)
- Natural Language (1.00)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence