Caption This, Reason That: VLMs Caught in the Middle
–Neural Information Processing Systems
Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g.
Neural Information Processing Systems
Jun-16-2026, 10:06:02 GMT
- Country:
- North America > Canada (0.28)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Health & Medicine (0.48)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Cognitive Science (1.00)
- Natural Language
- Large Language Model (1.00)
- Chatbot (1.00)
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Information Technology > Artificial Intelligence