temporal interaction layer
Supplementary Material for " Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding " Y ang Jin
Then, the additional implementation details are provided in Section 2. Next, Section 3 presents more ablation study results with respect to model designs and hyper-parameter settings. The detailed computation pipeline of the proposed query-guided decoding is shown in Figure 1. The Architecture of the proposed query-guided decoder and prediction head. The proposed model is trained on 32 Nvidia A100 GPUs with 1 video per GPU. The detailed results are shown in Table 1 and Table 2. Finally, we provide the detailed ablation results of the temporal interaction layer for HC-STVG benchmark in Table 3b.
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding Y ang Jin 1, Y ongzhi Li
Spatio-Temporal video grounding (STVG) focuses on retrieving the spatiotemporal tube of a specific object depicted by a free-form textual expression. Existing approaches mainly treat this complicated task as a parallel frame-grounding problem and thus suffer from two types of inconsistency drawbacks: feature alignment inconsistency and prediction inconsistency .