Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment