NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Mar-18-2026, 20:02:49 GMT–Neural Information Processing Systems

Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term $\textbf{natural adversarial samples}$. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, ${\bf NaturalBench}$, for reliably evaluating VLMs with 10,000 human-verified VQA samples.

large language model, machine learning, natural language, (13 more...)

Neural Information Processing Systems

Mar-18-2026, 20:02:49 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (0.54)
    - Chatbot (0.54)
  - Machine Learning > Neural Networks
    - Deep Learning (0.54)