SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

Nov-6-2025–arXiv.org Artificial Intelligence

Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Y et, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multi-modal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment. Vision-language alignment has emerged as a key challenge in multimodal representation learning, with recent pretraining approaches achieving remarkable success by learning from web-scale data, driving progress in multimodal tasks such as image-text retrieval, visual question answering (VQA), and image captioning Gan et al. (2022). Ground-breaking work CLIP (Radford et al., 2021) has shown that a simple contrastive objective can yield state-of-the-art representations when scaled to millions of noisy image-text pairs, and such large-scale training has thus become the paradigm for vision-language foundation models. However, these web-scale corpora are notoriously noisy: captions can be generic, off-topic, or mismatched to the image.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-6-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report (0.70)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.66)
  - Machine Learning
    - Inductive Learning (0.54)
    - Neural Networks > Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found