Caption This, Reason That: VLMs Caught in the Middle

Jun-16-2026, 10:06:02 GMT–Neural Information Processing Systems

Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Jun-16-2026, 10:06:02 GMT

Conferences PDF

Add feedback

Country:
- North America > Canada (0.28)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Health & Medicine (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Cognitive Science (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found