Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Neural Information Processing Systems 

Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in openvocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only imagelevel supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found