Supplementary Material for " Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding " Y ang Jin