Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

Jian, Pu, Yu, Donglei, Yang, Wen, Ren, Shuo, Zhang, Jiajun

Sep-17-2025–arXiv.org Artificial Intelligence

In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs' capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.

large language model, natural language, question answering, (17 more...)

arXiv.org Artificial Intelligence

Sep-17-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)
- North America > Canada (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education (0.68)

Technology:
- Information Technology > Artificial Intelligence > Natural Language
  - Large Language Model (0.97)
  - Question Answering (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found