UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
–Neural Information Processing Systems
Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a range of carefully categorized vision-centric capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations.
Neural Information Processing Systems
May-31-2025, 12:58:02 GMT
- Country:
- Europe > Spain (0.14)
- North America > United States (0.14)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Health & Medicine
- Diagnostic Medicine (0.46)
- Therapeutic Area (0.46)
- Information Technology (0.68)
- Health & Medicine
- Technology: