Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
Ignatev, Daniil, Santeer, Ayman, Gatt, Albert, Paperno, Denis
–arXiv.org Artificial Intelligence
We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.
arXiv.org Artificial Intelligence
Nov-24-2025
- Country:
- Europe (0.68)
- North America > United States
- New Mexico (0.14)
- Genre:
- Research Report > New Finding (0.68)
- Technology: