Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Jun-10-2026, 09:17:47 GMT–Neural Information Processing Systems

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes.

large language model, machine learning, natural language, (7 more...)

Neural Information Processing Systems

Jun-10-2026, 09:17:47 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Vision (0.82)
  - Machine Learning (0.62)
  - Natural Language > Large Language Model (0.53)