Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges