Large Language Model
UniBench: VisualReasoningRequiresRethinking Vision-LanguageBeyondScaling
Wefind that while scaling training data ormodel size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today's best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve.