OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

May-27-2025, 00:17:48 GMT–Neural Information Processing Systems

We propose the new task open-world video instance segmentation and captioning. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing "abstractors" which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a large language model (LLM) by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

May-27-2025, 00:17:48 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)