Compositional Entailment Learning for Hyperbolic Vision-Language Models
Pal, Avik, van Spengler, Max, di Melendugno, Guido Maria D'Amely, Flaborea, Alessandro, Galasso, Fabio, Mettes, Pascal
–arXiv.org Artificial Intelligence
Image-text representation learning forms a cornerstone in vision-language models, where pairs of images and textual descriptions are contrastively aligned in a shared embedding space. Since visual and textual concepts are naturally hierarchical, recent work has shown that hyperbolic space can serve as a high-potential manifold to learn vision-language representation with strong downstream performance. In this work, for the first time we show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. The idea is that an image is not only described by a sentence but is itself a composition of multiple object boxes, each with their own textual description. Such information can be obtained freely by extracting nouns from sentences and using openly available localized grounding models. We show how to hierarchically organize images, image boxes, and their textual descriptions through contrastive and entailment-based objectives. Empirical evaluation on a hyperbolic visionlanguage model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning, as well as recent hyperbolic alternatives, with better zero-shot and retrieval generalization and clearly stronger hierarchical performance. Vision-language modeling has witnessed rapid progress in recent years with innovative approaches such as CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) using extensive vision-language data to train encoders for understanding visual and textual content simultaneously. Such encoders align visual scenes with textual descriptions in a shared high-dimensional Euclidean space, facilitating semantic understanding (Radford et al., 2021). While effective, conventional vision-language models only take a holistic approach to image-text representation learning, neglecting the intrinsic hierarchy and composition of elements within images. Indeed, a visual scene is commonly composed of multiple objects interacting with one another to form a precise context.
arXiv.org Artificial Intelligence
Oct-9-2024
- Country:
- South America > Chile
- Oceania > Australia
- North America
- United States
- New Jersey (0.04)
- Washington > King County
- Seattle (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.05)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- California > Los Angeles County
- Long Beach (0.04)
- Puerto Rico > San Juan
- San Juan (0.04)
- Canada
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Alberta > Census Division No. 15
- Improvement District No. 9 > Banff (0.04)
- United States
- Europe
- Austria (0.04)
- Sweden > Stockholm
- Stockholm (0.04)
- Switzerland > Zürich
- Zürich (0.14)
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- Italy > Tuscany
- Florence (0.04)
- Netherlands
- North Holland > Amsterdam (0.04)
- North Brabant > Eindhoven (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- France > Île-de-France
- Slovenia > Coastal-Karst
- Municipality of Koper > Koper (0.04)
- Asia > Middle East
- Africa > Rwanda
- Genre:
- Research Report > Promising Solution (0.34)
- Overview > Innovation (0.34)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Natural Language (1.00)
- Machine Learning
- Statistical Learning (0.68)
- Neural Networks > Deep Learning (0.46)
- Information Technology > Artificial Intelligence