Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
Ignatev, Daniil, Santeer, Ayman, Gatt, Albert, Paperno, Denis
–arXiv.org Artificial Intelligence
We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.
arXiv.org Artificial Intelligence
Nov-24-2025
- Country:
- Asia
- China (0.04)
- Singapore (0.04)
- Taiwan > Taiwan Province
- Taipei (0.04)
- Europe
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- Hawaii > Honolulu County
- Honolulu (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Hawaii > Honolulu County
- Canada > Ontario
- South America > Chile
- Asia
- Genre:
- Research Report > New Finding (0.68)
- Technology: