e6d58fc68c0f3c36ae6e0e64478a69c0-Supplemental-Conference.pdf
–Neural Information Processing Systems
It consists of an image encoder with a Vision Transformer [17] architecture, a text encoder with a similar Transformer architecture, and heads that predict bounding boxes and label scores from provided images and text queries. Input(s) An image and a list of free-text object descriptions (queries).
Neural Information Processing Systems
Apr-30-2026, 03:24:20 GMT