Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training

Dec-23-2025, 21:33:12 GMT–Neural Information Processing Systems

Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective.

inter-modality, self-attention, visual parsing, (11 more...)

Neural Information Processing Systems

Dec-23-2025, 21:33:12 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.98)
  - Vision > Image Understanding (0.39)