Goto

Collaborating Authors

 Europe


UniBench: VisualReasoningRequiresRethinking Vision-LanguageBeyondScaling

Neural Information Processing Systems

Wefind that while scaling training data ormodel size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today's best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve.