IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Shahgir, Haz Sameen, Sayeed, Khondker Salman, Bhattacharjee, Abhik, Ahmad, Wasi Uddin, Dong, Yue, Shahriyar, Rifat

Mar-30-2024–arXiv.org Artificial Intelligence

The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best-performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of GeminiPro on the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.

illusion, optical illusion, vlm, (15 more...)

arXiv.org Artificial Intelligence

Mar-30-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California
    - Los Angeles County > Los Angeles (0.14)
    - Riverside County > Riverside (0.04)
- Asia
  - Bangladesh (0.04)
  - Japan > Shikoku
    - Kagawa Prefecture > Takamatsu (0.04)

Genre:
- Research Report (0.50)

Industry:
- Health & Medicine > Therapeutic Area > Neurology (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.94)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found