Learning from Models and Data for Visual Grounding

He, Ruozhen, Cascante-Bonilla, Paola, Yang, Ziyan, Berg, Alexander C., Ordonez, Vicente

Mar-20-2024–arXiv.org Artificial Intelligence

We introduce SynGround, a novel framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models to enhance the visual grounding capabilities of a pretrained vision-and-language model. The knowledge transfer from the models initiates the generation of image descriptions through an image description generator. These descriptions serve dual purposes: they act as prompts for synthesizing images through a text-to-image generator, and as queries for synthesizing text, from which phrases are extracted using a large language model. Finally, we leverage an open-vocabulary object detector to generate synthetic bounding boxes for the synthetic images and texts. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention consistency objective that aligns region annotations with gradient-based model explanations. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model. Particularly, SynGround improves the pointing game accuracy of ALBEF on the Flickr30k dataset from 79.38% to 87.26%, and on RefCOCO+ Test A from 69.35% to 79.06% and on RefCOCO+ Test B from 53.77% to 63.67%.

proceedings, synthetic data, visual grounding, (12 more...)

arXiv.org Artificial Intelligence

Mar-20-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > Orange County > Irvine (0.04)
- Europe
  - Switzerland > Zürich
    - Zürich (0.14)
  - Netherlands > North Holland
    - Amsterdam (0.04)
  - Belarus > Gomel Region
    - Gomel (0.04)
- Asia > Middle East
  - Republic of Türkiye > Karaman Province > Karaman (0.04)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (0.88)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found