Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?

Jun-23-2026, 16:02:33 GMT–Neural Information Processing Systems

Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term .

artificial intelligence, name change, proceedings, (5 more...)

Neural Information Processing Systems

Jun-23-2026, 16:02:33 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.79)