Scalable Mask Annotation for Video Text Spotting