What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Open in new window