Goto

Collaborating Authors

 Giallanza, Tyler


Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem

arXiv.org Artificial Intelligence

Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs), including both multimodal language models and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks -- such as counting, localization, and simple forms of visual analogy -- that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that many of the puzzling failures of state-of-the-art VLMs can be explained as arising due to the binding problem, and that these failure modes are strikingly similar to the limitations exhibited by rapid, feedforward processing in the human brain.


Human-Like Geometric Abstraction in Large Pre-trained Neural Networks

arXiv.org Artificial Intelligence

Specifically, we apply that can capture regularities in the external world. By neural network models to behavioral tasks from recent empirical forming abstractions that can generalize to future experience, work (Sablé-Meyer et al., 2021, 2022; Hsu, Wu, & humans are able to exhibit efficient learning and strong generalization Goodman, 2022) that catalogue three effects indicative of abstraction across domains (Lake, Salakhutdinov, & Tenenbaum, in human geometric reasoning. First, humans are 2015; Hull, 1920). One domain in which this has sensitive to geometric complexity, such that they are slower been observed by cognitive scientists is geometric reasoning to recall complex images as compared to simpler ones (Sablé- (Dehaene, Al Roumi, Lakretz, Planton, & Sablé-Meyer, Meyer et al., 2022). Second, humans are sensitive to geometric 2022), where people consistently extract abstract concepts, regularity (based on features such as right angles, parallel such as parallelism, symmetry, and convexity, that generalize sides, and symmetry) such that they are able to classify regular across many visual instances.