SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment
–arXiv.org Artificial Intelligence
Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Y et, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multi-modal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment. Vision-language alignment has emerged as a key challenge in multimodal representation learning, with recent pretraining approaches achieving remarkable success by learning from web-scale data, driving progress in multimodal tasks such as image-text retrieval, visual question answering (VQA), and image captioning Gan et al. (2022). Ground-breaking work CLIP (Radford et al., 2021) has shown that a simple contrastive objective can yield state-of-the-art representations when scaled to millions of noisy image-text pairs, and such large-scale training has thus become the paradigm for vision-language foundation models. However, these web-scale corpora are notoriously noisy: captions can be generic, off-topic, or mismatched to the image.
arXiv.org Artificial Intelligence
Nov-6-2025