UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

May-31-2025, 12:58:02 GMT–Neural Information Processing Systems

Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a range of carefully categorized vision-centric capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations.

benchmark, large language model, machine learning, (17 more...)

Neural Information Processing Systems

May-31-2025, 12:58:02 GMT

Conferences PDF

Add feedback

Country:
- Europe > Spain (0.14)
- North America > United States (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Health & Medicine
  - Diagnostic Medicine (0.46)
  - Therapeutic Area (0.46)
- Information Technology (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language > Large Language Model (0.98)
  - Representation & Reasoning (1.00)
  - Vision (1.00)