Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval
–Neural Information Processing Systems
While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with image embeddings is suboptimal for performing such queries. As a solution, we propose to use image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes.
Neural Information Processing Systems
Jun-13-2026, 17:14:39 GMT
- Technology: