Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Apr-28-2026, 01:46:22 GMT–Neural Information Processing Systems

Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in openvocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only imagelevel supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training.

computer vision, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Apr-28-2026, 01:46:22 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
BridgingtheGapbetweenObjectandImage-level RepresentationsforOpen-VocabularyDetection

Similar Docs Excel Report more

Title	Similarity	Source
None found