Learning Speech Representation From Contrastive Token-Acoustic Pretraining