Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding Y ang Jin 1, Y ongzhi Li