SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Jan-18-2025, 06:41:18 GMT–Neural Information Processing Systems

This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps.

semantic-assisted object cluster, temporal variation, video object segmentation, (4 more...)

Neural Information Processing Systems

Jan-18-2025, 06:41:18 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (1.00)