Revisiting Few-Shot Object Detection with Vision-Language Models
–Neural Information Processing Systems
The era of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of "open-world" perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs. First, we point out that zero-shot predictions from VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO. Despite their strong zero-shot performance, such foundation models may still be sub-optimal. For example, trucks on the web may be defined differently from trucks for a target application such as autonomous vehicle perception.
Neural Information Processing Systems
May-28-2025, 18:36:52 GMT
- Industry:
- Automobiles & Trucks (0.46)
- Transportation > Ground
- Road (0.67)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.94)
- Natural Language
- Chatbot (0.94)
- Large Language Model (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence