Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem

Open in new window