AITopics | glipv2

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks.

glipv2, name change, unifying localization, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Appendix of GLIPv2: Unifying Localization and Vision-Language Understanding

Neural Information Processing SystemsAug-19-2025, 16:11:55 GMT

In Section 1, we provide more visualizations of our model's predictions on various localization and VL understanding tasks. In Section 2, we describe all our evaluated tasks and their dataset in detail. In Section 8, we give out a comparison for the model's inference speed. It has about 900k bounding box annotations for 80 object categories, with about 7.3 We predict use any-box protocol specified in MDETR. L VIS uses the same images as in COCO, re-annotated with more object categories.

machine learning, natural language, object-oriented architecture, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.68)

Add feedback

ea370419760b421ce12e3082eb2ae1a8-Paper-Conference.pdf

Neural Information Processing SystemsAug-19-2025, 16:11:54 GMT

artificial intelligence, machine learning, natural language, (14 more...)

Neural Information Processing Systems

Genre: Research Report (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)

Add feedback

GLIPv2: Unifying Localization and Vision-Language Understanding

Neural Information Processing SystemsJan-19-2025, 05:03:46 GMT

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks.

detection task, glipv2, unifying localization, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

GLIPv2: Unifying Localization and Vision-Language Understanding

Zhang, Haotian, Zhang, Pengchuan, Hu, Xiaowei, Chen, Yen-Chun, Li, Liunian Harold, Dai, Xiyang, Wang, Lijuan, Yuan, Lu, Hwang, Jenq-Neng, Gao, Jianfeng

arXiv.org Artificial IntelligenceOct-11-2022

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2206.05836

Genre: Research Report > New Finding (0.34)

Technology: