Appendix of GLIPv2: Unifying Localization and Vision-Language Understanding
–Neural Information Processing Systems
In Section 1, we provide more visualizations of our model's predictions on various localization and VL understanding tasks. In Section 2, we describe all our evaluated tasks and their dataset in detail. In Section 8, we give out a comparison for the model's inference speed. It has about 900k bounding box annotations for 80 object categories, with about 7.3 We predict use any-box protocol specified in MDETR. L VIS uses the same images as in COCO, re-annotated with more object categories.
Neural Information Processing Systems
Aug-19-2025, 16:11:55 GMT
- Technology: