Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding Y ang Jin 1, Y ongzhi Li

Neural Information Processing Systems 

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatiotemporal tube of a specific object depicted by a free-form textual expression. Existing approaches mainly treat this complicated task as a parallel frame-grounding problem and thus suffer from two types of inconsistency drawbacks: feature alignment inconsistency and prediction inconsistency .

Similar Docs  Excel Report  more

TitleSimilaritySource
None found