Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Neural Information Processing Systems 

Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.