UniBench: VisualReasoningRequiresRethinking Vision-LanguageBeyondScaling

Neural Information Processing Systems 

Wefind that while scaling training data ormodel size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today's best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found