Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding