Supplementary Material for " Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding " Y ang Jin
–Neural Information Processing Systems
Then, the additional implementation details are provided in Section 2. Next, Section 3 presents more ablation study results with respect to model designs and hyper-parameter settings. The detailed computation pipeline of the proposed query-guided decoding is shown in Figure 1. The Architecture of the proposed query-guided decoder and prediction head. The proposed model is trained on 32 Nvidia A100 GPUs with 1 video per GPU. The detailed results are shown in Table 1 and Table 2. Finally, we provide the detailed ablation results of the temporal interaction layer for HC-STVG benchmark in Table 3b.
Neural Information Processing Systems
Aug-18-2025, 08:07:02 GMT
- Technology: