Right this way: Can VLMs Guide Us to See More to Answer Questions?

May-27-2025, 20:56:55 GMT–Neural Information Processing Systems

In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical and challenging task in the Visual Question Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals who often need guidance to capture images correctly. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task.

answer question, information, vlm guide us, (2 more...)

Neural Information Processing Systems

May-27-2025, 20:56:55 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.88)
  - Machine Learning (0.62)