OW-VISCap: Open-World Video Instance Segmentation and Captioning

Choudhuri, Anwesa, Chowdhary, Girish, Schwing, Alexander G.

Apr-4-2024–arXiv.org Artificial Intelligence

Open-world video instance segmentation is an important video understanding task. Yet most methods either operate in a closed-world setting, require an additional user-input, or use classic region-based proposals to identify never before seen objects. Further, these methods only assign a one-word label to detected objects, and don't generate rich object-centric descriptions. They also often suffer from highly overlapping predictions. To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video. For this, we introduce open-world object queries to discover never before seen objects without additional user-input. We generate rich and descriptive object-centric captions for each detected object via a masked attention augmented LLM input. We introduce an inter-query contrastive loss to ensure that the object queries differ from one another. Our generalized approach matches or surpasses state-of-the-art on three tasks: open-world video instance segmentation on the BURST dataset, dense video object captioning on the VidSTG dataset, and closed-world video instance segmentation on the OVIS dataset.

machine learning, natural language, segmentation, (18 more...)

arXiv.org Artificial Intelligence

Apr-4-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report (0.82)

Industry:
- Food & Agriculture (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Performance Analysis
    - Accuracy (0.47)
  - Natural Language (0.89)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found