Supplemental: A Benchmark for Compositional Text-to-image Retrieval
–Neural Information Processing Systems
GQA GQA has annotations of objects and attributes in images. We use this to construct queries like "square white plate". We train on the GQA train split (with the test unseen queries and corresponding images removed). Hence, we have around 67K training images and 27K queries. CLEVR On CLEVR, we test on 96 classes on 22,500 images.
Neural Information Processing Systems
Oct-9-2025, 01:29:58 GMT