Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

Open in new window