Learning Music Sequence Representation from Text Supervision