Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

Bugliarello, Emanuele, Nematzadeh, Aida, Hendricks, Lisa Anne

Oct-19-2023–arXiv.org Artificial Intelligence

Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we can tap into supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional image descriptions. With masked relation prediction, we further encourage relating entities from image regions with visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised relations data.

computational linguistic, proceedings, relation, (13 more...)

arXiv.org Artificial Intelligence

Oct-19-2023

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - Washington > King County
      - Seattle (0.04)
    - New York > New York County
      - New York City (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
  - Canada
    - Ontario > Toronto (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Spain (0.04)
  - Ireland (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre:
- Research Report > New Finding (0.87)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.35)
  - Machine Learning
    - Neural Networks (0.46)
    - Inductive Learning (0.40)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found