OWLViz: An Open-World Benchmark for Visual Question Answering

Nguyen, Thuy, Nguyen, Dang, Nguyen, Hoang, Luong, Thuan, Dang, Long Hoang, Lai, Viet Dac

Mar-4-2025–arXiv.org Artificial Intelligence

We present a challenging benchmark for the Open WorLd VISual question answering (OWLViz) task. OWLViz presents concise, unambiguous queries that require integrating multiple capabilities, including visual understanding, web exploration, and specialized tool usage. While humans achieve 69.2% accuracy on these intuitive tasks, even state-of-the-art VLMs struggle, with the best model, Gemini 2.0, achieving only 26.6% accuracy. Current agentic VLMs, which rely on limited vision and vision-language models as tools, perform even worse. This performance gap reveals significant limitations in multimodal systems' ability to select appropriate tools and execute complex reasoning sequences, establishing new directions for advancing practical AI research.

full front cover, reasoning, second floor, (15 more...)

arXiv.org Artificial Intelligence

Mar-4-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Maryland (0.04)
  - Washington > King County
    - Seattle (0.04)
  - Oregon > Lane County
    - Eugene (0.15)
  - Florida > Miami-Dade County
    - Miami (0.04)
- Europe > Austria
  - Vienna (0.14)
- Asia
  - Thailand > Bangkok
    - Bangkok (0.04)
  - China > Sichuan Province
    - Chengdu (0.04)

Genre:
- Research Report (0.82)

Industry:
- Consumer Products & Services (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.90)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found