Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges

Open in new window