TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment