Goto

Collaborating Authors

 distractor


ed519dacc89b2bead3f453b0b05a4a8b-Supplemental.pdf

Neural Information Processing Systems

Figure 11: Comparison of HCAM (labeled as HTM) with different chunk sizes to TrXL across the different ballet levels. The performance of the HCAM model is robust to varying chunk size, indicating that HCAM does not need a task-relevant segmentation to perform well.








Supplemental: A Benchmark for Compositional Text-to-image Retrieval

Neural Information Processing Systems

GQA GQA has annotations of objects and attributes in images. We use this to construct queries like "square white plate". We train on the GQA train split (with the test unseen queries and corresponding images removed). Hence, we have around 67K training images and 27K queries. CLEVR On CLEVR, we test on 96 classes on 22,500 images.