DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding

Lo, Kin Ian, Hawashin, Hala, Abbaszadeh, Mina, Limback-Stokin, Tilen, Wazni, Hadi, Sadrzadeh, Mehrnoosh

Sep-26-2025–arXiv.org Artificial Intelligence

Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence's grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP's SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision-language tasks.

machine learning, natural language, tensor, (20 more...)

arXiv.org Artificial Intelligence

Sep-26-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.67)
- North America
  - United States (0.28)
  - Mexico (0.28)

Genre:
- Research Report (0.70)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Grammars & Parsing (1.00)
    - Text Processing (0.94)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)