Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

Kalibhat, Neha, Kattakinda, Priyatham, Zarei, Arman, Seleznev, Nikita, Sharpe, Samuel, Kumar, Senthil, Feizi, Soheil

May-25-2024–arXiv.org Artificial Intelligence

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%).

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

May-25-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > Maryland (0.15)

Genre:
- Research Report (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks (0.94)
    - Natural Language > Text Processing (0.93)
    - Representation & Reasoning (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found