Multi-Scale Temporal Difference Transformer for Video-Text Retrieval