Measuring How (Not Just Whether) VLMs Build Common Ground

Imai, Saki, İnan, Mert, Sicilia, Anthony, Alikhani, Malihe

Sep-5-2025–arXiv.org Artificial Intelligence

Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

computational linguistic, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Sep-5-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)
- Europe > Switzerland (0.28)
- Asia > Middle East
  - UAE (0.46)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Leisure & Entertainment > Games (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.95)
  - Machine Learning > Neural Networks
    - Deep Learning (0.54)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found