Trust but Verify: Programmatic VLM Evaluation in the Wild
Prabhu, Viraj, Purushwalkam, Senthil, Yan, An, Xiong, Caiming, Xu, Ran
–arXiv.org Artificial Intelligence
Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. To construct PROVE, we provide a large language model (LLM) with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.5k challenging but visually grounded QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a programmatic evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework. We benchmark the helpfulness-truthfulness trade-offs of a range of VLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two. Vision-language models (VLMs) have emerged as an effective solution for generating responses to queries about visual content. This has led to a flurry of research on reliably benchmarking VLM performance (Liu et al., 2024a), by measuring not just the helpfulness but also the truthfulness of their responses. Existing benchmarks fall into two categories - discriminative (Hu et al., 2023; Lovenia et al., 2023; Li et al., 2023), which evaluate the model's responses to close-ended, existence-based queries ("Is there a man in this image?"), While discriminative benchmarks ease evaluation, they do not realistically simulate in-the-wild usage.
arXiv.org Artificial Intelligence
Oct-16-2024