Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection
–Neural Information Processing Systems
Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in openvocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only imagelevel supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training.
Neural Information Processing Systems
Apr-28-2026, 01:46:22 GMT
- Genre:
- Research Report > New Finding (0.68)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Natural Language (1.00)
- Machine Learning (1.00)
- Information Technology > Artificial Intelligence