Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding

Ignatev, Daniil, Santeer, Ayman, Gatt, Albert, Paperno, Denis

Nov-24-2025–arXiv.org Artificial Intelligence

We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Nov-24-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.68)
- North America > United States
  - New Mexico (0.14)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.31)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found